The July 2022 issue of IEEE Spectrum is here!

Close bar

Can Computing Keep up With the Neuroscience Data Deluge?

When an imaging run generates 1 terabyte of data, analysis becomes the problem

2 min read
Can Computing Keep up With the Neuroscience Data Deluge?
Image: Jeremy Freeman et al

human os icon

Today's neuroscientists have some magnificent tools at their disposal. They can, for example, examine the entire brain of a live zebrafish larva and record the activation patterns of nearly all of its 100,000 neurons in a process that takes only 1.5 seconds. The only problem: One such imaging run yields about 1 terabyte of data, making analysis the real bottleneck as researchers seek to understand the brain.

To address this issue, scientists at Janelia Farm Research Campus have come up with a set of analytical tools designed for neuroscience and built on a distributed computing platform called Apache Spark. In their paper in Nature Methods, they demonstrate their system's capabilities by making sense of several enormous data sets. (The image above shows the whole-brain neural activity of a zebrafish larva when it was exposed to a moving visual stimulus; the different colors indicate which neurons activated in response to a movement to the left or right.)

The researchers argue that the Apache Spark platform offers an improvement over a more popular distributed computing model known as Hadoop MapReduce, which was originally based on Google's search engine technology. Here's how Spectrum described these conventional systems in an article on "DNA and the Data Deluge":

While Hadoop and MapReduce are simple by design, their ability to coordinate the activity of many computers makes them powerful. Essentially, they divide a large computational task into small pieces that are distributed to many computers across the network. Those computers perform their jobs (the “map” step), and then communicate with each other to aggregate the results (the “reduce” step). This process can be repeated many times over, and the repetition of computation and aggregation steps quickly produces results.

But the Janelia Farm researchers note that with MapReduce, data has to be loaded from disk for each operation. The Apache Spark advantage lies in its ability to cache data sets and intermediate results in the memory of many computers across the network, allowing for much faster iterative computations. This caching is particularly useful for neural data, which can be analyzed in many different ways, each offering a new view into the brain's structure and function.

The researchers have made their library of analytic tools, which they call Thunder, available to the neuroscience community at large. With U.S. government money pouring into neuroscience research for the new BRAIN Initiative, which emphasizes recording from the brain in unprecedented detail, this computing advance comes just in the nick of time.

The Conversation (0)
A photo showing machinery in a lab

Foundries such as the Edinburgh Genome Foundry assemble fragments of synthetic DNA and send them to labs for testing in cells.

Edinburgh Genome Foundry, University of Edinburgh

In the next decade, medical science may finally advance cures for some of the most complex diseases that plague humanity. Many diseases are caused by mutations in the human genome, which can either be inherited from our parents (such as in cystic fibrosis), or acquired during life, such as most types of cancer. For some of these conditions, medical researchers have identified the exact mutations that lead to disease; but in many more, they're still seeking answers. And without understanding the cause of a problem, it's pretty tough to find a cure.

We believe that a key enabling technology in this quest is a computer-aided design (CAD) program for genome editing, which our organization is launching this week at the Genome Project-write (GP-write) conference.

With this CAD program, medical researchers will be able to quickly design hundreds of different genomes with any combination of mutations and send the genetic code to a company that manufactures strings of DNA. Those fragments of synthesized DNA can then be sent to a foundry for assembly, and finally to a lab where the designed genomes can be tested in cells. Based on how the cells grow, researchers can use the CAD program to iterate with a new batch of redesigned genomes, sharing data for collaborative efforts. Enabling fast redesign of thousands of variants can only be achieved through automation; at that scale, researchers just might identify the combinations of mutations that are causing genetic diseases. This is the first critical R&D step toward finding cures.

Keep Reading ↓Show less