How Microbiome Research Is Becoming a Huge Data Science

Not everyone thought the Human Genome Project was a good idea. Back in the late 1980s and early 1990s — when the project was still in its planning stages — some prominent scientists argued that sequencing the entire human genome was a boondoggle. “I find it less than obvious that this information will have utility beyond measure,” wrote Massachusetts Institute of Technology biologist Robert Weinberg. The project, critics worried, would siphon away precious grant funding from individual labs and pump it into a large, government project that was unlikely to yield many genuinely important results.

Three decades later, the Human Genome Project has more than repaid its investment, transformed nearly every corner of biomedical research, and laid the foundation of today’s biotech industry. One of the central ways that the project influenced the development of science and technology was by providing a freely available reference data set, which researchers used to develop new computational tools and sequencing technologies. As a result, biomedical research has become one of the biggest fields of data science.

A similar development is happening with a vast amount of non-human DNA that plays a vital role in our health: DNA of the trillions of symbiotic bacteria that make up our “virtual organ,” known as the gut microbiome. Our microbiomes have a subtle but pervasive impact on the functioning of our metabolism, our susceptibility to disease, and our response to drugs. However, most bacterial species in our microbiomes have been, until recently, largely invisible, since most are unable to grow in petri dishes. To track these microbes, scientists rely on DNA sequencing, collected from fecal samples. As with the Human Genome Project, researchers are trying to facilitate microbiome research by building large reference data sets, which serve as a foundation for new technologies and data analysis tools.

One of the key data analysis challenges of microbiome research is to assemble complete genomes from small fragments of DNA sequences extracted from fecal samples. The job is like reassembling thousands of books from a dumpster filled with their ripped pages. If you don’t have the original texts as a reference, this is hard to do. But, with a decent computer and a master copy of the text, the job becomes almost trivial. This is why teams of researchers have recently focused on assembling hundreds of thousands of reference genome sequences of the human gut microbiome, including a new data set published in July. This latest work compiled genomes of thousands of species and more than 170 million of their non-human gene sequences. In our bodies, bacterial genes outnumber human genes nearly 10,000 to one.

The Microbiome as a Data Science Problem

These enormous data sets present new challenges and opportunities for computational biologists who seek to understand, and even manipulate, the human microbiome for human health. One of the key challenges is that, by themselves, bacterial genomes aren’t that useful. They need to be analyzed in conjunction with other data types. The microbiome matters to us because it changes over time in response to age, diet, medication, and even diseases like cancer. Our gut microbes co-metabolize our food with us, manipulate our immune systems, and form extensive metabolic networks with each other. To do all of this, gut bacteria express a vast repertoire of genes.

To make sense of the microbiome, researchers need to track how the expression of these bacterial genes changes over time and differs among patients. This often involves linking microbiome sequences with patient blood tests, epigenome data, clinical findings, and even histological images. Research teams are beginning to build resources that integrate these different data types. Resources like ColPortal, which focuses on colorectal cancer samples, bring together different data sets in a form that makes it easier for data analysts to focus on answering questions, rather than simply bringing together the data.

Another challenge is to apply state-of-the-art analysis methods, like machine learning, to big, heterogeneous microbiome data sets. Machine learning algorithms can be great at classifying samples based on subtle patterns in a complex data set. For example, one goal of microbiome research is to predict early stage cancer based on changing signatures in a patient’s microbiome composition. If this worked, the regular colonoscopies that we’re all supposed to get after age 50 could be replaced with a much less invasive screening method, one needing only a stool sample.

Machine learning, however, generally isn’t for amateurs. Most microbiome scientists aren’t machine learning experts, and there is no reason for them to be — their job is in the lab. To make sure that high-quality machine learning is brought to bear on the problem, several programs are focused on building machine learning tools for microbiome data. The European Union-funded ML4 Microbiome is collecting data sets, establishing data standards, and building software that could be widely used in the research community. The Microbiome Learning Repo, run by Dan Knights of the University of Minnesota, is a publicly available repository of machine learning tools. Not too long ago, microbiome data scientists needed to build tools like this from scratch. Today, they can focus instead on data analysis.

Microbiome Data for Biotech

How will these new microbiome resources make a difference outside of the lab? Microbiome research isn’t only a subject for academic teams; over a dozen biotech startups are now working in the field, many less than five years old. Microbiome technologies can be classified into several common approaches, each of which relies on microbiome DNA sequencing and computer modeling:

Microbiome Transplants: Fecal microbiomes taken from healthy donors have shown some success as a treatment for chronic gastrointestinal infections. Companies like Rebiotix and MaaT Pharma are running clinical trials of microbiome treatments for diseases like bacterial infections and ulcerative colitis. One key to success will be understanding exactly what a “good” microbiome looks like — something that can only be figured out by analyzing microbiome sequencing data.

“Bugs as Drugs”: Rather than recreate the profile of a healthy microbiome, another approach is to focus on the metabolic function of certain species of gut bacteria. Seres Therapeutics hopes to improve the treatment of patients who are undergoing immuno-therapy for metastatic melanoma, a highly fatal cancer. Since the microbiome interacts with the immune system, Seres has developed a mixture of bacteria that targets the immune system, with the goal of helping these patients respond better to therapy. To understand how bacteria turn the knobs on the human immune system, it’s critical to know what genes they express and model how those genes work in concert.

Microbiome Engineering: One of the more ambitious ways to control the microbiome is to genetically engineer it. The French company Eligo Biosciences aims to use bacterial viruses, called phages, to carry CRISPR gene editing technology into the microbiome. The idea is to genetically edit the bacteria in your gut to express helpful genes or to kill infectious bacteria. This technology is unlikely to show up in the clinic any time soon, but it can benefit from the new large databases of gut bacterial genes. Eligo technology sometimes targets antibiotic resistance genes within infectious bacteria – an approach that depends on identifying those genes from among the hundreds of millions of bacterial genes out there.

Microbiome Diagnosis: Perhaps one of the most promising applications of microbiome data is in diagnostics — especially cancer. Tumors generate a lot of unusual metabolic byproducts, which then alter the microbiome. Companies like Metabiomics are based on the idea that such microbiome changes can be used to catch cancers early — long before symptoms begin to show up. To be successful, microbiome diagnostics will need to rely on good models that tease out any signs of danger from the day-to-day and week-to-week fluctuations of a healthy microbiome.

* * *

Microbiome data is dizzyingly complex, even by the standards of today’s data-intensive biomedical sciences. But as with most data science fields, the pace of research is accelerating as microbiome researchers build new tools and databases that others can use to answer new questions. As that happens, the balance of work in the field tips away from the scientists at the lab bench and more toward data scientists at the keyboard.

The Microbiome as a Data Science Problem

Microbiome Data for Biotech

Recent Data Science Articles