In the age of Big Data, it turns out that the largest, fastest growing data source lies within your cells.
Quantitative biologists at the University of Illinois Urbana-Champaign and Cold Spring Harbor Laboratory, in New York, found that genomics reigns as champion over three of the biggest data domains around: astronomy, Twitter, and YouTube.
The scientists determined which would expand the fastest by evaluating acquisition, storage, distribution, and analysis of each set of data. Genomes are quantified by their chemical constructs, or base pairs. Genomics trumps other data generators because the genome sequencing rate doubles every seven months. If it maintains this rate, by 2020 more than one billion billion bases will be sequenced and stored per year, or 1 exabase. By 2025, researchers estimate the rate will be almost one zettabase, one trillion billion bases, per sequence per year.
90 percent of the genome data analyzed in the study was human. The scientists estimate that 100 million to 2 billion human genomes will be sequenced by 2025. That’s a four to five order of magnitude of growth in ten years, which far exceeds the other three data generators they studied.
“For human genomics, which is the biggest driver of the whole field, the hope is that by sequencing many, many individuals, that knowledge will be obtained to help predict and cure a variety of diseases,” says University of Illinois Urbana-Champaign co-author, Gene Robinson. Before it can be useful for medicine, genomes must be coupled with other genomic data sets, including tissue information.
One reason the rate is doubling so quickly is because scientists have begun sequencing individual cells. Single-cell genome sequencing technology for cancer research can reveal mutated sequences and aid in diagnosis. Patients may have multiple single cells sequenced, and there could end up being more than 7 billion genomes sequenced.
That “is more than the population of the Earth,” says Michael Schatz, associate professor at Cold Spring Harbor Laboratory, in New York. “What does it mean to have more genomes than people on the planet?”
What it means is a mountain of information must be collected, filed, and analyzed.
“Other disciplines have been really successful at these scales, like YouTube,” says Schatz. Today, YouTube users upload 300 hours of video every minute, and the researchers expect that rate to grow up to 1,700 hours per minute, or 2 exabytes of video data per year, by 2025. Google set up a seamless data-flowing infrastructure for YouTube. They provided really fast Internet, huge hard drive space, algorithms that optimized results, and a team of experienced researchers.
“We need that investment in genomics in order to understand your diseases, what kinds of treatments to apply, or answer questions about ancestry,” Schatz says. “By sequencing hundreds of millions of people, we can look through the pattern. We can get a sense of global community, and how incredibly connected we really are.”