Computing the Neanderthal Genome

Reconstructing cavemen from bits of fossil DNA is still the stuff of science fiction. But thanks to high-powered computing wizardry, we now have the blueprints you’d need to do it. An international team of scientists published the first draft of the Neanderthal genome in the journal Science on 7 May. The study showed, somewhat surprisingly, that early humans and Neanderthals interbred and that 1 to 4 percent of the DNA in modern Asians and Europeans comes from Neanderthals.

The bulk of the credit for decoding the Neanderthal goes to high-throughput sequencing technologies developed in the past five years, which turned bits of ancient DNA into millions of short strings of letters. But sequencing the Neanderthal genome would have been impossible without the sophisticated software that put all those millions of strings together in the right order.

Both the sequencing and computing of whole genomes have come a long way in the nine years since scientists first delivered the DNA blueprints for humans. Then, sequencing machines read 330 000 base molecules per day. (DNA contains a series of four different base molecules: adenine, cytosine, guanine, and thymine, or A, C, G, and T.) The latest machines made by Illumina, in San Diego, can sequence about 2 billion bases a day, amounting to hundreds of terabytes of data to store and analyze.

Assembling the human genome was like putting together a 100-piece jigsaw puzzle without any idea of what the picture should look like. The computers that were used to sequence the 3-billion-base human genome had to sort through about a hundred DNA fragments at a time, each made up of 2000 to 3000 bases. An assembly algorithm then compared these fragments, found overlaps, and strung together longer and longer stretches of the sequence.

The difference between now and then, Neanderthal and human, is both scale and complexity. DNA fragments from fossil Neanderthal bones are very short, having degraded to an average length of only 50 bases, and there are millions of them, not hundreds. Some of these are chemically damaged in such a way that one base has mutated into another. And even worse, more than 96 percent of the DNA sequences scientists obtained came not from Neanderthals but from microbes that have contaminated the bones. To continue with the puzzle analogy, you now have millions of very small pieces, some with faded color, of which only a few thousand belong to your puzzle.

”The challenge is to find a needle in a haystack,” says Janet Kelso, a bioinformatics researcher at the Max Planck Institute for Evolutionary Anthropology, in Leipzig, Germany, where much of the sequencing and data analysis was done.

Luckily, the completed human and chimpanzee genomes can serve as reference pictures for the Neanderthal puzzle—modern humans and Neanderthals share about 99 percent of their genome. ”The computational analysis of this is not terribly difficult in concept,” says Richard Green, a biomolecular engineering professor at the University of California, Santa Cruz, who led the computing efforts at Max Planck. ”Take all of these sequences and align them against the human and chimp genome.” That is, try to locate millions of 50-odd base sequences on the 3-billion-base human and chimp genomes.

This process, called mapping, might be conceptually simple, but in practice it’s difficult enough to take days of computing time, even on a machine with hundreds of processor cores at work. Mapping is a common feature of modern genetics research, but it can fail with fossil DNA because of base mutations. So the researchers have developed their own mapping program, called Anfo Short Read Aligner/Mapper.

Like conventional mappers, Anfo cuts up the reference genome (that of the humans and chimps) into small ”words” and arranges them in an index. The 60-base sequences from the fossils are then broken into 16-base words. The algorithm looks for these words in the index, assigns scores based on similarity, and finds the best matches.

The key difference between Anfo and regular mappers is that the new program takes into account knowledge scientists have gained over years of sequencing fossil DNA. For example, scientists have found that degraded DNA fragments have a high rate of C to T mutations at the beginning of a fragment and G to A mutations at the end. So Anfo considers these factors in determining whether a base is correct.

With two dedicated Linux clusters totaling 500 processor cores, the program takes about a week to sort through the millions of fragments generated by the sequencing machine, decide which are Neanderthal and which are not, and then put them in order.

Anfo is a work in progress, with programmers adding upgrades regularly. But improvements to sequencing machines are doubling those devices’ outputs about every year. So coders will have to race to keep up.

This article originally appeared in print as "Computing the Caveman".

medical diagnostics dna bioinformatics sequencing software neanderthal genome

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Computing the Neanderthal Genome

New software helped decode the DNA of our stone-age cousins

Video Friday: Musculoskeletal Robot Dog

The Untold History of the RESISTORS

Ghost Robotics’ Arm Brings Manipulation to Military Quadrupeds

Related Stories

Why IT Projects Repeat Costly Mistakes

Trillions Spent and Big Software Projects Are Still Failing

Airflow: From Stagnation to Millions of Downloads

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and post comments — all free! For full access and benefits, subscribe to Spectrum.

Computing the Neanderthal Genome

New software helped decode the DNA of our stone-age cousins

Video Friday: Musculoskeletal Robot Dog

The Untold History of the RESISTORS

Ghost Robotics’ Arm Brings Manipulation to Military Quadrupeds

Related Stories

Why IT Projects Repeat Costly Mistakes

Trillions Spent and Big Software Projects Are Still Failing

Airflow: From Stagnation to Millions of Downloads