In a paper published yesterday in Nature Biotechnology, the researchers say their error-correcting software boosts the accuracy of single-molecule sequencing results to a very impressive 99.9 percent. This is for "de novo" sequencing, a tough task, in that scientists taking their first look at a species' genome don't have any prior results for comparison's sake.
In single-molecule sequencing, the method championed by the company Pacific Biosciences, a single molecule of DNA is diced up into pieces and sequenced only once. The more standard "second generation" sequencing methods require the DNA molecule to be cloned thousands or millions of times over.
Single-molecule sequencing is an elegant procedure, and it has the advantage of producing longer "reads"—in other words, longer sequences of the paired nucleotides that encode a species' biological instructions.
"It's like putting together a jigsaw puzzle," explains study coauthor Michael Schatz, who runs a quantitative biology lab at Cold Spring Harbor Laboratory. "If the puzzle's tiles are bigger then you have more information, and you can do a better job with assembly."
However, single-molecule sequencing also has a big disadvantage: Because a DNA molecule is sequenced only once, the accuracy is low. Pacific Biosciences' top machine produces reads that are about 83 percent accurate. In second-generation sequencing, the millions of identical DNA strands are diced up, sequenced, and then checked against each other. Such methods produce very accurate results, with few errors in the sequence of nucleotides. But those accurate results are contained in many small reads that are difficult to assemble correctly into a continuous genome.
Schatz and his colleagues came up with a hybrid process that uses data from both a single-molecule sequencer and a second-generation machine; he says it "combines the virtues of the two technologies." Essentially, their software compares the long, error-riddled reads from the single-molecule sequencer to many copies of the short fragments from a second-generation machine. It uses a very careful string-matching algorithm to line up the short reads to the long reads, tolerating a significant number of nucleotide insertions, deletions, substitutions and other nucleotide errors in the long reads.
The result: Long reads that can be more easily assembled into a complete genome, and that have an accuracy of 99.9 percent. The team made the software open-source, and it has already been downloaded by many researchers working with sequencer machines from Pacific Biosciences.The researchers tested their system by sequencing the genomes of several species, including, most impressively, the genome of the parrot Melopsittacus undulatus (aka the budgie).
"Our collaborators were interested in parrots because they use them as a model for language development," says Schatz. The geneticists want to study the parrot's version of those genes associated with language in humans, genes like FoxP2. Schatz says that the error-correcting software produced clear reads of these genes, and, more importantly, produced a clearer picture of where the genes fit in the long DNA strand, and what pieces of code surround them. "If you want a really high-quality genome, it's worth spending more time in the upfront sequencing, because you reduce the time in the follow up when you're trying to see how things are connected," says Schatz. "Our process is expensive, but you get much more information out."
Pacific Biosciences' single-molecule sequencing is cool stuff, but we're also keeping an eye on other methodologies, like the nanopore sequencing being pursued by several academic groups and hot companies. There's also the new chip-based sequencing device from Ion Torrent, a subsidiary of Life Technologies, which the company says will finally allow an entire genome to be sequenced for $1000 before the end of this year.
Images: Wikimedia Commons