Deep Learning Picks Apart DNA Data-Copying Puzzles

Genetic transcription is a data problem, and AI is on the case

2 min read
Colorful structure of green pieces with a group of pink blobs in the center.

DNA copying malfunctions can cause disease and debilities, and now deep learning has helped crack the puzzle of how the copying process starts, stops, and runs astray.

Laguna Design/Science Source

DNA, as a data-storage medium, is useful only when read, copied, and sent out elsewhere. The medium for conveying genetic information out of a cell’s nuclei is RNA—transcribed from DNA, which itself never leaves the cell’s nuclei. Now, using deep learning, researchers at Northwestern University, in Evanston, Ill., have untangled a complex part of the RNA transcription process: how cells know when to stop copying.

In RNA transcription, knowing when to stop is crucial. The information coded into RNA is used throughout a cell to synthesize proteins and regulate a wide range of metabolic processes. Getting the right message to its intended target requires those RNA strands to say just as much as they need to—and nothing more. If they say more or less than they need, as can be the case in a number of diseases like epilepsy or muscular dystrophy, then any number of those metabolic processes can break down or malfunction to debilitating effect.

“This is a very useful prescreening tool for investigating genetic variants in a high-throughput manner.”
—Emily Kunce Stroup, Northwestern University

Halting the RNA copying process—called polyadenylation (polyA) for the string of adenine molecules it ties onto the end of a cut-off RNA strand—involves a range of proteins whose interactions have never been fully understood.

So to help unravel polyA, researchers Zhe Ji and Emily Kunce Stroup at Northwestern University developed a machine-learning model that can locate and identify polyA sites. It works by pairing convolutional neural networks (CNNs) trained to match important sequences in the genetic code with recurrent neural networks (RNNs) trained to study the CNN outputs.

While previous models had taken a similar approach, using both CNNs and RNNs, these researchers then fed the CNN/RNN model’s outputs into two other deep-learning models trained to locate and identify polyA sites in the genome.

The two additional models seem to have helped. “Having those tandem outputs is the really unique thing from our work,” says Stroup. “Having the model go outwards to two separate output branches that we then combine to identify sites at high resolution is what distinguishes us from existing work.”

From their model, the researchers learned a few important aspects of what can cause polyA to go well or poorly. The CNN part of the model learned genetic patterns in DNA known to attract the proteins controlling polyA, while the RNN part of the model revealed that reliably cutting off transcription requires careful spacing between those patterns. And these researchers could make such precise conclusions because of the model’s per-nucleotide resolution. “It’s striking that our model can precisely capture this,” says Ji.

Moving forward, the team says they plan to apply their model and similar techniques to research identifying key genetic mutations that potentially cause diseases and then to develop from that a possible pipeline of more targeted therapeutic drugs. “This is a very useful prescreening tool for investigating genetic variants in a high-throughput manner,” says Stroup. “This will hopefully help whittle down the number of candidate mutations to make the process more efficient.”

Stroup says the team also plans to re-create their research in other organisms to see how RNA transcription changes between different animals. They hope, she says, to use that knowledge to help control or prevent polyA when its processes are out of control—as in the cases of epilepsy and muscular dystrophy—and causing real harm.

The researchers published their paper in the journal Nature.

The Conversation (0)