These days, when we talk about data, we usually mean a lot of data — not just a few hundred entries in a spreadsheet, but rather hundreds of thousands or more data points stored in massive databases. Netflix knows which of its roughly 6,000 shows are watched by its 170 million subscribers. Amazon tracks the more than one billion purchases made through its site each year. And geneticists have collected DNA from millions of people to discover hundreds of millions of genetic variants. Like Amazon and Netflix, geneticists are turning to machine learning to find patterns in their data.
Machine learning, a class of self-teaching artificial intelligence techniques, are increasingly popular because of their ability to tease out complex patterns from data that are often nearly impossible to find with methods of traditional statistics. One of the tasks at which machine learning excels is to label things. Apple wants to put names to the faces in your photo album. Netflix wants to know whether you’re likely to give a show a thumbs up or a thumbs down. And geneticists want to know whether a genetic variant in a patient’s gene is causing disease.
This problem of labeling genetic variants is one of the big challenges of medical genetics. It’s called the VUS problem — meaning “variants of uncertain significance.” When a patient gets a genetic test as part of a diagnostic workup — for cancer, a heart condition, or perhaps an unusual childhood disease — doctors want to know whether that patient carries a disease-causing genetic variant. Identifying pathogenic variants helps doctors make a diagnosis and pick the right treatment. But the challenge is that every one of us carries many genetic variants, most of them probably harmless. This means that, in many — if not most — cases, genetic test results come back with mutations that are labeled VUS. Such results are essentially useless — a cancer patient may have a mutation in a cancer-related gene, but if that mutation is a VUS, the doctor has no idea whether or not it plays a role in the patient’s disease.
To solve the VUS problem, geneticists and computational biologists are turning to machine learning. Ultimately, researchers aim to label all of the hundreds of millions of known (and many more as yet unknown) human genetic variants, classifying them as either benign or potentially pathogenic. Tackling all variants in one giant machine learning model isn’t realistic however, since these variants can fall in any of tens of thousands of genes and hundreds of thousands of regulatory sequences that control those genes, each of which functions in its own specific way. A model like that would be like building a program that not only helps Major League Baseball scouts pick out promising players, but also predicts the performance of any potential draft pick in any sport, from basketball to cricket.
To solve the VUS problem, geneticists and computational biologists break the problem down into small tasks. Here are some of the most important strategies they use.
Machine Learning Should Be Deep Learning
Researchers who study genetic variants are putting much of their machine learning efforts into a set of methods called “deep learning.” These methods are called “deep” because they consist of multiple layers of computational units, called neurons, that are connected together into so-called neural networks. Like biological neurons, computational neurons work by receiving an input, processing it, and then passing the output along to the next layer of neurons.
One big advantage of deep neural network models is that they automate the process of picking out the features of the data that should go into the model, something which not all machine learning algorithms do. For example, Netflix hires teams of so-called taggers to watch movies and tag them with descriptive terms like “visually striking,” “true bromance,” and “cerebral TV drama.” Those descriptive terms are put into the models that make personalized recommendations to subscribers. But tagging segments of DNA is hard — what’s the genetic equivalent of “true bromance”?
This is where the power of deep learning comes in. Thanks to their layered architecture, deep learning models mine the DNA sequence of A’s, T’s, C’s, and G’s, and find the relevant features on their own. This is the equivalent of having a computer watch a movie and come up with its own tags. That’s hard to do with a movie, but much easier with the long lines of text that represent a DNA sequence.
Solve Hard Problems by Learning Easier Tasks
Determining whether a genetic variant drives a patient’s cancer is a hard problem. There is a long chain of events between a cancer-causing mutation and full-blown disease. An easier problem is predicting whether a genetic variant has a more immediate impact on the molecular function of a segment of DNA. If a variant has no impact on a molecular function, then that variant probably isn’t causing disease.
Scientists have built several deep learning models that predict whether a variant affects the so-called epigenetic state of segments of DNA. The epigenetic state of a segment of DNA indicates whether that segment is active or inactive in a particular cell type. An epigenetic-state-altering mutation might switch on a gene that is supposed to be off, which could, for example, lead to the uncontrolled cell division of a growing tumor.
One of the models that make such predictions is called DeepSEA, developed by Princeton University scientists Jian Zhou and Olga Troyanskaya. DeepSEA was trained on nearly 60,000 genetic variants, together with publicly available databases of epigenetic data that were obtained from measurements made in cultured human cells. Given a DNA sequence, the model predicts which variants alter epigenetic states with fair accuracy. While these predictions don’t tell you whether a variant causes disease, models like DeepSEA are a way for geneticists to prioritize which variants to study in more depth.
Make Your Own Data
One of the biggest challenges of the VUS problem is that all humans are special — that is, each of us carries unique genetic variants that have never been seen before. A 2016 study looked for genetic variants in the genomes of over 60,000 people and found that more than half of all variants are unique to the person who carries them. This poses a huge problem for genetic testing — most patients will come in with genetic variants about which we know nothing. A major task for machine learning is to correctly classify these previously unseen variants.
One way that researchers are solving this problem is to just make all of the as-yet unseen variants in the lab. Because DNA synthesis is now relatively inexpensive, scientists can synthesize all possible mutations of a gene and test the effect of each one on that gene’s function. This method, called “deep mutational scanning,” promises to deliver useful data sets to computational biologists. In 2017, one group of scientists predicted that deep mutational scanning data will eventually be available for all genes and regulatory elements in the human genome. While this field is still new, the data produced over the next few years will be a valuable resource for computational biologists building deep learning models.
While most deep learning models of genetic variants are still built by academic labs, a few companies are applying deep learning models to develop new therapies. Deep Genomics, founded by University of Toronto professor Brendan Frey, is seeking to use deep learning models to develop drugs that target specific genetic variants. Because of genetic variants, different patients with the same disease often don’t benefit from the same drug, but it is not always clear which variants cause patients to respond differently. Deep Genomics is using machine learning to pick out the relevant variants and develop drugs to target them. The ultimate goal is to have therapies that are more effective, because they are tailored to a patient’s genetic make-up.
Like nearly every type of business and research field today, geneticists have produced data sets that are too complex and too large to be analyzed by human intuition or even traditional statistics. Much of the future of medicine depends on understanding the human genome — and that understanding increasingly depends on machine learning.