Rethinking Databases for an Avalanche of Genetic Data

In a lab on the ground floor of deCODE Genetics’ building in Reykjavik, Iceland, robots quietly go about chip-typing or “SNP-typing” the latest of 155,000 or so Icelanders. SNPs are single-nucleotide polymorphisms, or small variations in genetic code that can represent the basis for disease or health, presence or absence of a condition of some type or other. A floor up, Illumina HiSeq X machines, worth millions of dollars each, take a few days to come up with a complete human genome. They’ve done so for about 3,600 Icelanders at latest count.

All that sequencing means an enormous amount of data. The As, Ts, Cs, and Gs add up, and of course, the idea here is to actually make use of those letters. Across the building from the sequencing labs, I spoke with Hakon Gudbjartsson, deCODE’s VP for informatics, on the challenges and methods for dealing with mountains of data. Each individual person sequenced accounts for around 100 gigabytes of data, and it’s data, he says, “that requires a lot of organization.”

The primary means to organize the genetic information is a database the company developed that is known as GOR, or genomically ordered relations. Traditional databases, like Oracle or MySQL, organize data in tables that don’t quite make sense for genetic information; Gudbjartsson says using such methods creates bottlenecks when trying to retrieve the data. The GOR database organizes genetic code according to a “reference build,” essentially placing the data in sequential order.

“It’s a database that organizes the downstream data according to the position in the genome,” Gudbjartsson says. All the specific variations observed also fit in based on their physical place. “Whether its a SNP or… a copy number variation, anything. All the tables are basically ordered according to the genome.”

That ordering allows the design of algorithms to query the information in a much more efficient fashion. Researchers can even “stream” the genomic information, instead of calling up one specific spot.

The goal at deCODE (now owned by Amgen) is to take this impressive collection of genetic data and match it up with rich clinical and genealogical data as well. Iceland is home to only 320,000 people or so, and all Icelanders can trace their lineages to 1650, and often further back. Along with good recent medical records, that means the ability to take a genotype and match it up with a phenotype; they have produced some impressive results, perhaps most famously the discovery of a gene in 2012 that confers almost complete protection against development of Alzheimers disease.

This rich data set has created some controversy. Some critics have expressed discomfort with aggressive DNA collection methods (all participants do sign consent forms) and the apparent ability to make “data inferences” based on available data and those rich genealogical and clinical records. (Essentially, even if an individual doesn’t give consent for sequencing, but enough others do, the close genetic connections in Iceland could allow the researchers to fill in the gaps.) However, Gudbjartsson points out that everyone at deCODE signs an agreement to never actually use such inferences.

Gudbjartsson says that GOR can run on several hundred computers simultaneously already. “GOR should be elastic,” he adds, noting that, “we foresee growth.” It is already a challenge to efficiently transpose the full sequencing data from the 3600 or so completed genomes onto the chip data for the 155,000 Icelanders in the database (a way to find common variations, and match with phenotypes). But sequencing will inevitably get even faster and cheaper. The full genome for all Icelanders, or other populations around the world, will present an even greater challenge for data manipulation.

it genetic sequencing decode genetics software

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Rethinking Databases for an Avalanche of Genetic Data

DeCODE Genetics in Iceland is on a mission, sequencing thousands of Icelanders to help fight disease. How are they dealing with the data?

Vision 60 Quadruped Gets Arm Upgrade

Chiplet Boosts GPU Efficiency by 50%

Chess by Telegraph: A Surprising 1844 Innovation

Related Stories

Why IT Projects Repeat Costly Mistakes

Trillions Spent and Big Software Projects Are Still Failing

Airflow: From Stagnation to Millions of Downloads

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and post comments — all free! For full access and benefits, subscribe to Spectrum.

Rethinking Databases for an Avalanche of Genetic Data

DeCODE Genetics in Iceland is on a mission, sequencing thousands of Icelanders to help fight disease. How are they dealing with the data?

Vision 60 Quadruped Gets Arm Upgrade

Chiplet Boosts GPU Efficiency by 50%

Chess by Telegraph: A Surprising 1844 Innovation

Related Stories

Why IT Projects Repeat Costly Mistakes

Trillions Spent and Big Software Projects Are Still Failing

Airflow: From Stagnation to Millions of Downloads