AI in biology: Could data hygiene be holding you back?

Akash Guru, Jonathan Karr

July 23, 2024

The field of machine learning (ML) has evolved from the fundamental Perceptron to Large Language Models, driven by novel architectures and vast amounts of internet data, supported by millions of dollars in GPU training time. These large models, now interchangeably called artificial intelligence (AI), have made remarkable strides in image generation, language processing, strategy games, robotics, and much more. Applications enabled by AI have reached many industries from software engineering¹ to graphic design².

Yet in the domain of biotech, extracting the benefits of AI is often a challenge, as scattered, unstructured data generated by the research process stands in the way. Here we explore the common ways AI is used in biology, and what process tools you might want to consider to get your data ready for AI models and scientific analysis.

How AI is used in biology

At the intersection of AI and biology, advanced models can now be applied to understand, engineer and design the building blocks of life such as DNA, RNA, and proteins. This has been made possible by the troves of data produced from high throughput measurement technologies, such as genomic sequencing, single cell RNA sequencing, proteomics, metabolomics and public databases that curate these data. Looking broadly across life science, we see three main categories of how modern AI can be used:

Figure 1: Major areas of AI application in modern biology

Models for biological design help design biological entities such as proteins, RNA, viral vectors and more. These are large models that are trained on molecule specific datasets that have been curated over decades. Prominent examples are AlphaFold and ESMFold that provide the capability to predict the structures of proteins from their amino acid sequences, enabling rapid investigation of protein structures with less laborious wet lab experimentation. Additional examples include models for predicting RNA structures, virtually screening small molecules, and designing antibodies. At Deep Origin, we are building state of the art models for virtually screening small molecules. Learn about our approach here.

Knowledge tools are models that are trained on large amounts of knowledge, such as research articles, reports, images and graphs, and make it more accessible to scientists through chat-like interfaces or automatically to perform actions based on the knowledge. Examples include generalized models such as ChatGPT, Claude, Gemini and life science specific models like Med-Gemini, PubMedGPT, Tx-LLM, Chemcrow, etc. These models can be used to write scientific articles, extract knowledge from articles³ ^,4, and generate code. Tools such as Chemcrow uses general LLMs along with external domain specific tools to autonomously plan and execute chemical synthesis. Deep Origin serves these knowledge tools in our platform so scientists can perform research faster. Learn about our approach here.

‍Experiment-guided adaptive design tools can help design biological systems or experiments by iteratively learning from experimental data. For example, an antibody screening company might want to improve their designs over time. They can do this by first building a ML model based on data from their initial designs and then using the model to suggest the next set of designs to explore. They can then use the resulting data to refine their ML model, and iteratively repeat this process. Similarly, a company could discover transcription factors that reduce aging by iteratively using a ML model to design sets of transcription factors, performing large-scale screening experiments, and using their measurements to refine their model. At Deep Origin, we're helping companies using ML to run their discovery platforms. Contact us to explore how we can help your company.

Category	Example model	Training data
Foundational models for biological design	AlphaFold RoseTTAFold ProtGPT2 ESMfold	Protein Data Bank (PDB) Protein Data Bank (PDB) UniProt Uniprot
Knowledge tools	Med-Gemini PubMedGPT ChatGPT, Claude, Gemini, Llama Tx-LLM (Therapeutic LLM)	PubMed, medical literature databases PubMed Diverse text datasets including scientific literature Therapeutic Data Commons
Experiment-guided adaptive design tools	Recursion Your Company	RxRx ?

Table 1: Example AI models and tools in biology with their corresponding training data sources

Models are dependent on data

All of these categories of models have a common need - clean data. In the first category, the prominent breakthrough - AlphaFold - owes much of its success to the Protein Data Bank (PDB), a database of 218,196 experimentally-determined protein structures that has painstakingly been curated by the scientific community for the last 53 years. This curation has created a massive data set that is ready for training models such as AlphaFold. Each structure in the PDB has been validated for consistency and thoroughly documented. Furthermore, the PDB provides an API that makes it easy to retrieve each structure.

In the second category, prominent life science specific knowledge tools such pubmedGPT and Tx-LLM have been trained on PubMed central, a database of articles carefully curated by the NCBI, and the Therapeutic Data Commons, a set of 66 data sets meticulously curated by researchers at Harvard, MIT, and Stanford.

However, most biotech companies would fall under the third category where they use their own data to refine foundational models and knowledge models, and then use these models to guide experimentation and discovery. However, many companies are limited by the poor annotation, inconsistent formatting, and scattered organization of their data. To harness the potential of AI, companies need to tackle this critical data engineering problem.

The data problem: Organization of data

As the first two categories of models have shown, organized data is an essential ingredient for AI. However, in research organizations, data is often scattered across various devices, formats, and teams. Furthermore, critical metadata is often missing, annotated inconsistently, and disconnected from data.

For example, let’s take a company that aims to screen proteins on cell behavior. Figure 2 shows a typical workflow the company might use. They may have metadata spread across Excel spreadsheets in laptops and Google Sheets, data files distributed among computers attached to its instruments and Google Drive, analysis code dispersed across laptops and GitHub, and processed results and reports scattered across electronic notebooks and PowerPoint slides. In particular, different teams within companies frequently silo their data in different places. For example, sequence data may be stored separately from cellular imaging data and plate assay data. This siloing makes it challenging to study biological problems holistically, such as to use multiple assays to screen drugs for their efficacy, localization, and toxicity. Because data is typically scattered, even tracking the provenance of a decision back to its supporting data can often take multiple days. For most companies, this lack of standardization and interoperability is a major hindrance to leveraging AI.

Figure 2: The complex research workflow in biotech: A scattered collection of tools, data, and processes.

Current approaches and their limitations

Numerous efforts have addressed parts of this data organization problem. However, as we will discuss below, the current solutions fall short.

Electronic Laboratory Notebooks (ELNs)

Electronic Laboratory Notebooks have emerged as a popular tool for experimental scientists to capture their day-to-day research activities and observations. In particular, ELNs can help scientists document their experimental procedures, metadata, and small data files. Notably, the flexibility and unstructured text freedom offered by ELNs enables researchers to capture experimental notes in a manner that suits their individual preferences and workflows.

However, this flexibility comes at a cost. The lack of standardization in data and metadata across different scientists and experiments typically makes it challenging to perform comparative analyses and extract insights. Without a consistent and structured approach to data, ELNs make it difficult for scientists to realize the full potential of their data. Further, most ELNs can’t store large data files, and most computational scientists find it hard to programmatically access data from ELNs. These limitations further hinder organizations from integrating their data, identifying patterns, and making data-driven decisions.

Scientific Data Management Systems (SDMS)

Scientific Data Management Systems have been developed to centrally store and organize research data. In particular, these systems focus on long-term data preservation, such as for regulatory audits.

While SDMS are a valuable solution for data storage, they often fall short in facilitating data analysis and visualization. The lack of seamless integration with other data analysis tools and platforms means that SDMS are often a data dead end, where data is isolated and underutilized.

Laboratory Information Management Systems (LIMS)

Laboratory Information Management Systems have been traditionally used to log, track, and manage samples throughout their life cycles, from reception to processing and eventual disposal. LIMS are often bespoke solutions tailored to the specific needs of individual laboratories and assays. Furthermore, because most LIMS prioritize sample management over data management, LIMS often fall short of facilitating data analysis and decision-making.

The need for a holistic solution: Data management + AI

Figure 3: Deep Origin platform unifies all processes in a research workflow

AI has the power to transform and accelerate biology; however, data disorganization remains a significant challenge that often prevents building or training good models.

There is a need for a solution that prioritizes data standardization, interoperability, and accessibility. This solution should not only store data and metadata, but also enable analysis, including the ability to run ML pipelines. Ultimately, to unlock the true potential of AI, scientists need a unified platform that can capture, integrate, and analyze diverse data.

At Deep Origin, we are solving these challenges by developing the first end-to-end platform that will enable scientists to organize their data and annotate it, perform interactive and automated analyses with advanced AI, create charts, and make decisions (Figure 3). Ultimately, our goal is to help scientists discover therapeutics faster and more reliably.

Sign up for Deep Origin OS today

Free access for a limited time during beta

‍

References

Will Heaven (2023) How AI assistants are already changing the way code gets made. MIT Technology Review. Link
Bernard Marr (2023) The Rise Of Generative AI In Design: Innovations And Challenges. Forbes. Link
Microsoft Research AI4Science (2023). The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4. arXiv:2311.07361. Link
Qiang Zhang, Keyang Ding, Tianwen Lyv, Xinda Wang, Qingyu Yin et al. (2024) Scientific Large Language Models: A Survey on Biological & Chemical Domains. arXiv:2401.14656. Link

‍

Heading

Akash Guru, Jonathan Karr

July 23, 2024