In 2005, next-generation sequencing began to change the field of genetics research. Obtaining a person’s entire genome became fast and relatively cheap. Databases of genetic information were growing by the terabyte, and doctors and researchers were in desperate need of a way to efficiently sift through the information for the cause of a particular disorder or for clues to how patients might respond to treatment.
Companies have sprung up over the past five years that are vying to produce the first DNA search engine. All of them have different tactics—some even have their own proprietary databases of genetic information—but most are working to link enough genetic databases so that users can quickly identify a huge variety of mutations. Most companies also craft search algorithms to supplement the genetic information with relevant biomedical literature. But as in the days of the early Web, before Google reigned supreme, a single company has yet to emerge as the clear winner.
Making a functional search engine is a classic big-data problem, says Michael Gonzalez, the vice president of bioinformatics at one such company, ViaGenetics, which was expected to relaunch its platform in March. Before doctors or researchers can use the data, genomic data must be organized so that humans can read and search it. The first step toward that is to put it in a standard form called the variant call format, or VCF. As raw data, a person’s complete sequenced genome would take up about 100 gigabytes, so a database that adds the genomes of even 10 patients per day would quickly get out of hand. But VCF files are more compact, requiring only a few hundred megabytes per genome, which helps researchers find the specific variants they want to search in a fraction of the time. Unlike a fully sequenced genome, VCF files point only to where a person’s genetic data deviates from the standard–the genome originally compiled by the Human Genome Project in 2001.
With VCF, sifting the genomes themselves for pinpoint mutations isn’t the challenge for search engine companies. Most of these companies are allocating their resources toward efforts to seamlessly compile supplementary information about a specific mutation from other databases across the Web, such as the biomedical research archive PubMed or various troves of electronic medical records. Many of these tools have finely tuned algorithms that prioritize the results by credibility or relevance. “You want to be able to pull together the information known about a mutation in that position [of the genome] and quickly make an assessment,” says David Mittelman, the chief scientific officer for Tute Genomics, based in Provo, Utah, another company designing a genetic-search engine.
In an effort to expand the information that can be attached to a genome under examination, ViaGenetics, based in Miami Beach, Fla., is making its newly updated platform useful for researchers who want to collaborate across institutions. With ViaGenetics’ tools, researchers “can make their data available to other users, so other people can come across these projects, request access, and form a collaboration,” Gonzalez says. “It helps people connect the dots between different researchers and institutions.” This is especially helpful for smaller labs that may not have very extensive genome databases or for researchers from different universities working to decode the same mutation.
Although the genomic-search industry is now focused on serving scientists, that might not always be the case. Mittelman envisions that Tute Genomics could eventually serve consumers directly. People are already demanding information about their genomes just to understand themselves better, Mittelman says, but most companies don’t yet consider the average person to be their primary customer. In order to make that shift, the tool will have to be even more intuitive and user-friendly. “Fire-hosing someone with data that’s not easy to interpret, or using terminology that’s not standardized, has the potential to confuse people,” he says. Privacy is also a major concern for the average user; the information that Tute users upload isn’t stored permanently, Mittelman says, but users will need extra reassurance if the platform becomes available to the lay public.
And a further evolution of the industry is in the offing. Both ViaGenetics and Tute are hoping to be able to run the entire process in-house—from the initial DNA sequencing to the presentation of final searchable results to users. “The market for analyzing and interpreting genomic data is very fragmented, like the computer industry in the 1990s, where you had to go to separate providers to buy a video card or a motherboard and then try to put it together,” Mittelman says. “Soon this field will consolidate, as the computer industry did.”
This article originally appeared in print as “A Google for DNA.”
About the Author
Alex Ossola is a New York City-based journalist who writes frequently for Popular Science and Motherboard. Ossola first heard about the highly competitive fight to devise the best DNA search engine in passing while at a New Year’s Eve party. It wasn’t the usual way she gets story tips, but she knew immediately that she wanted to pursue it. “When something like that falls into your lap, you pull at the thread and find something really cool at the end,” Ossola says.