Reconstructing cavemen from bits of fossil DNA is still the stuff of science fiction. But thanks to high-powered computing wizardry, we now have the blueprints you’d need to do it. An international team of scientists published the first draft of the Neanderthal genome in the journal Science on 7 May. The study showed, somewhat surprisingly, that early humans and Neanderthals interbred and that 1 to 4 percent of the DNA in modern Asians and Europeans comes from Neanderthals.
The bulk of the credit for decoding the Neanderthal goes to high-throughput sequencing technologies developed in the past five years, which turned bits of ancient DNA into millions of short strings of letters. But sequencing the Neanderthal genome would have been impossible without the sophisticated software that put all those millions of strings together in the right order.
Both the sequencing and computing of whole genomes have come a long way in the nine years since scientists first delivered the DNA blueprints for humans. Then, sequencing machines read 330 000 base molecules per day. (DNA contains a series of four different base molecules: adenine, cytosine, guanine, and thymine, or A, C, G, and T.) The latest machines made by Illumina, in San Diego, can sequence about 2 billion bases a day, amounting to hundreds of terabytes of data to store and analyze.
Assembling the human genome was like putting together a 100-piece jigsaw puzzle without any idea of what the picture should look like. The computers that were used to sequence the 3-billion-base human genome had to sort through about a hundred DNA fragments at a time, each made up of 2000 to 3000 bases. An assembly algorithm then compared these fragments, found overlaps, and strung together longer and longer stretches of the sequence.
The difference between now and then, Neanderthal and human, is both scale and complexity. DNA fragments from fossil Neanderthal bones are very short, having degraded to an average length of only 50 bases, and there are millions of them, not hundreds. Some of these are chemically damaged in such a way that one base has mutated into another. And even worse, more than 96 percent of the DNA sequences scientists obtained came not from Neanderthals but from microbes that have contaminated the bones. To continue with the puzzle analogy, you now have millions of very small pieces, some with faded color, of which only a few thousand belong to your puzzle.
”The challenge is to find a needle in a haystack,” says Janet Kelso, a bioinformatics researcher at the Max Planck Institute for Evolutionary Anthropology, in Leipzig, Germany, where much of the sequencing and data analysis was done.
Luckily, the completed human and chimpanzee genomes can serve as reference pictures for the Neanderthal puzzle—modern humans and Neanderthals share about 99 percent of their genome. ”The computational analysis of this is not terribly difficult in concept,” says Richard Green, a biomolecular engineering professor at the University of California, Santa Cruz, who led the computing efforts at Max Planck. ”Take all of these sequences and align them against the human and chimp genome.” That is, try to locate millions of 50-odd base sequences on the 3-billion-base human and chimp genomes.
This process, called mapping, might be conceptually simple, but in practice it’s difficult enough to take days of computing time, even on a machine with hundreds of processor cores at work. Mapping is a common feature of modern genetics research, but it can fail with fossil DNA because of base mutations. So the researchers have developed their own mapping program, called Anfo Short Read Aligner/Mapper.
Like conventional mappers, Anfo cuts up the reference genome (that of the humans and chimps) into small ”words” and arranges them in an index. The 60-base sequences from the fossils are then broken into 16-base words. The algorithm looks for these words in the index, assigns scores based on similarity, and finds the best matches.
The key difference between Anfo and regular mappers is that the new program takes into account knowledge scientists have gained over years of sequencing fossil DNA. For example, scientists have found that degraded DNA fragments have a high rate of C to T mutations at the beginning of a fragment and G to A mutations at the end. So Anfo considers these factors in determining whether a base is correct.
With two dedicated Linux clusters totaling 500 processor cores, the program takes about a week to sort through the millions of fragments generated by the sequencing machine, decide which are Neanderthal and which are not, and then put them in order.
Anfo is a work in progress, with programmers adding upgrades regularly. But improvements to sequencing machines are doubling those devices’ outputs about every year. So coders will have to race to keep up.
This article originally appeared in print as "Computing the Caveman".