Deep Origin Data Hub: Building Frictionless Machines for the Life Sciences

Srinivas Gorur-Shandilya

August 19, 2024

Successful R&D programs are all alike; every biotech company thinks it faces its own unique challenges.

With apologies to Tolstoy, I will make the case that biotechnology companies face common pitfalls as they operate their research and development programs, and that most companies can maximize their chance of success by adopting common tools and best practices. In this post I will describe some of the primary challenges facing R&D teams in biotech companies, and how we at Deep Origin believe we can tackle these problems to make you more productive and successful.

Friction is the success killer

If there is one word that captures the challenge experimental and computational scientists face, both in industry and in academia, it is this: friction.

Friction raises its clammy fingers to stymie you, the experimenter, who has spent months getting ready to generate data. You are raring to go, but you now have to pause to address the challenges of organizing, storing, and backing up the data you want to generate. Experimental scientists want to iterate quickly over few dozens of protocols or conditions, but each iteration takes much longer than it needs to (and is more painful!) because information is frustratingly siloed: your system for writing down evolving protocols isn’t the same system you use for collecting data, which is different from your system for sharing data with your computational colleagues.

Friction bogs down computational work too; making you process your freshly collected data sequentially when a cluster could process it in parallel, and friction prevents you from quickly exploring data and instead makes you spend your entire day trying to install GPU drivers.

Friction may not routinely be named as a leading threat to R&D, but it is at the heart of many of the biggest barriers scientists face. It stifles creativity and slows down the pace of research by making every iteration more expensive. The challenges scientists face on a daily basis aren’t that they lack powerful hardware or brilliant algorithms to do science; rather, it is the interstices of processes and components, the struggle gluing things together and the work required to grease the moving parts to keep data moving smoothly.

Modern science runs on flexible teamwork

No scientist is an island; gone are the days of the sole genius laboring in solitude, only to emerge years later with deep insight. Science today is done by interdisciplinary teams composed of individuals with diverse expertise. For example, in a pre-clinical drug discovery program, one person might be responsible for animal care and husbandry. Another might lead genetic engineering. A third might focus on surgeries. This specialization means that the scientists who generate data often handoff data to other colleagues for analysis and visualization.

Such a division of labor is now commonplace. Such specialization is needed because each step is extremely challenging and requires years of training. For teams to work effectively on the same project, with a shared outcome, there has to be robust infrastructure that ensures that data is handed off smoothly, that data and metadata (and intent!) is communicated and transmitted clearly across team members.

Each person in this team is dealing with friction in their individual work; a system that makes individuals and the team successful will also have to minimize friction as each person generates resources or data that must be consumed by colleagues.

How can R&D teams reduce this friction? In our pre-clinical example, a technician could use a database to keep track of each animal, where each row records a single animal, together with metrics such as the age, genotype, diet, and other metadata about each animal. The experimenter who uses these animals could use a second database to record their measurements. Each row in their database could capture an experimental session with an animal, and would include a cross-reference to the animal database together with metadata about the drug regime and experimental parameters that were tested in each session. Finally, the computational scientist who analyzes this data could use a third database to capture their analysis of each animal with cross-references to the animal and its experimental sessions.

The creation, management and administration of such a database system is no trivial feat, especially when access controls are needed, for example, to provide each team with write access to their respective databases but read-only access to all other databases. This data engineering, while incredibly useful to R&D teams, is often either omitted or relegated to computational scientists, where it competes for priority with data analysis.

At Deep Origin, we want to eliminate this source of friction. Our vision is a central platform for planning, recording, and analyzing experiments that enables interdisciplinary teams to work together seamlessly. Last November, we released the first piece of our vision: powerful workstations with pre-installed software for life science. Our workstations make it easy for computational scientists to perform computationally-intensive analyses, free from the headaches of Linux systems administration and DevOps. This week, we released the second piece of our vision: the data hub, the central place for your measurement files, sample metadata, and experimental procedures. Our data hub makes it easy for experimental scientists to organize and share data. Together, our data hub and workstations enable research teams to quickly turn data into insights. Over the coming months, we will be launching additional modules to streamline additional parts of the R&D lifecycle.

Data is nothing without metadata

Measurement data is only valuable when it is tightly coupled to its metadata. For example, a large, multi-terabyte dataset collected by an experimentalist in our pre-clinical drug discovery example is only valuable in combination with metadata such as the genotype of each animal or the drug dose that each animal received. As a corollary, those terabytes of data are useless (or even worse, a pollutant, because we still have to store them) if that metadata is lost, incorrect, or siloed from the scientists responsible for its analysis. It is not uncommon for data to exist as a heap of files on a computer connected to an instrument, with metadata being manually tracked in a spreadsheet on that computer or with an isolated electronic laboratory notebook. Data munging is no one’s idea of fun.

A system for managing data and metadata at scale must navigate three challenges. First, the system must be capable of capturing measurement data files (of any type and size), structured metadata (typically columns), and procedural documents. Second, the system must be capable of helping researchers organize their data consistently. For example, research teams need to be able to use strongly typed schemas to validate their data. Third, because every research program is unique, the system must be customizable.

Our data hub allows you to create customized databases without writing a single line of code. Files can be uploaded directly to cells in databases, and other columns can be configured to capture a variety of metadata. Columns can be typed, and custom validators can help researchers quickly identify missing, malformed, and incorrect metadata. Furthermore, each research team can tailor their data hub to their specific needs, with custom columns and validations.

GUI or API? Porque no los dos?

One of the challenges with using a single system across a scientific team is that what’s easy for the experimenter is not necessarily what’s preferred by the computational biologist. For example, data collectors and experimentalists may prefer graphical user interfaces, such as tables, that enable them to click and modify individual rows. In contrast, a computational biologist may want to batch ingest entire tables via code.

To streamline the path from data generation to analysis, the Deep Origin data hub supports both paradigms, from individuals to batches and GUIs to APIs. For example, experimental scientists can use our web application to view and edit databases and files. At the same time, computational scientists can use our Python API to programmatically fetch data and write analyses back to the data hub.

Let there be visualization

One of the most crucial and under-appreciated parts of data analysis is visualization. Visualization is often key for quickly building actionable insights, especially amongst experimentalists, who may need to tweak an experimental parameter or protocol on the fly. However, the experimental scientists who are most in need of this immediate feedback often have the most most limited visualization capabilities because many visualizations are only available to those who code. As a result, experimental scientists often rely on computational biologists for analysis and visualization. This reliance often introduces delays, which slows iteration.

There are several challenges to short-circuiting this loop. First, scientific data is collected in a bewildering array of bespoke formats. Installing bespoke scientific software is often required just to view files generated by instruments, and no experimenter wants to wrestle with broken Python environments in the middle of a data collection session just to view the data they just collected. Second, because research is often at the bleeding edge of what is possible with existing software packages, scripts often must be written to load, process, and visualize data in the manner that makes sense for the experiment.

The Deep Origin data hub solves these dual problems and provides all scientists with multiple ways to visualize data. In many cases, data can be visualized directly in our web application, using rich interactive visualizations for multiple data types and file formats. For example, scientists can interactively explore GenBank files for plasmids uploaded to our data hub. Second, a context-aware AI assistant in our data hub enables researchers to generate advanced visualizations without writing code. We have designed our AI assistant so that experimenters can ask it to perform custom analysis on data specific columns or files in a database, write code to visualize this, and run that code, all without writing a single line of code. In addition to providing this powerful experience to experimentalists, right at the moment of data collection, it greatly reduces the workload on computational scientists in the team by automating many routine visualization and analysis tasks.

Modern science is a precision machine that manufactures insight

At their core, data generation and analysis are the foundations for insight for most R&D teams. However, the path from data generation to insight is rarely linear, frequently involving many cycles of iteration and refinement. The raw materials that feed into this machine are data, and the ultimate finished products are actionable insights.

At Deep Origin, we are committed to transforming the R&D process: by removing the friction along the conveyor belt of scientific discovery, and by building tools that multiply the productivity of scientists. Every company should not have to re-invent the wheel to manage and process their data; rather, biotech teams should be able to focus on their primary research. Let us help you do what you do best — your science, and leave the SciOps to us. Check out our platform or contact us to learn more!

‍

Heading

Srinivas Gorur-Shandilya

August 19, 2024