DNA data storage just got bigger and better. Scientists have reported the first random-access storage system from which they can recover individual data files, error free, from over 200 megabytes of digital information encoded into DNA.
Random access is key for a practical DNA-based memory, but until now, researchers have been able to achieve it with only up to 0.15 megabytes of data.
Since submitting their research, published in Nature Biotechnology, the team from Microsoft Research and the University of Washington has already improved on what they reported. Their storage system now offers random access across 400 megabytes of data encoded in DNA with no bit errors, says Microsoft Research’s Karin Strauss, who led the new work with Luis Ceze from the University of Washington.
Microsoft and other tech companies are seriously considering the possibility of archiving data in DNA. Current data storage technologies are not keeping up with the breakneck pace at which we generate digital content, Strauss says. Synthetic DNA is an attractive storage medium because it can, in theory, store 10 million times as much data as magnetic tape in the same volume, and it survives for thousands of years. Technology Reviewreports that Microsoft Research aims to have an operational DNA-based storage system working inside a data center toward the end of this decade.
DNA data storage involves translating the binary 0s and 1s of digital data into sequences of the four bases A, C, G, and T that make up DNA. The encoded sequences are synthesized and stored in vials. A DNA sequencing machine then decodes the data by recovering the sequences from DNA molecules. But it has been hard to access specific data files. Most research efforts until now have sequenced and decoded the entire bulk of the information stored in a vial. “It is not economical to sequence all the data you have stored every time you want to read a portion of it,” Strauss says.
To make a random access system, Strauss, Ceze, and their colleagues devised clever coding algorithms and turned to the polymerase chain reaction, a well-known lab technique used to make thousands of copies of DNA strands, called amplifying DNA.
The researchers worked with 35 files ranging in size from 29 kilobytes to over 44 MB, which they have stored in DNA before. They encoded each file into a large number of 150-base-long DNA snippets. They used the error-correcting Reed-Solomon code, but unlike their previous work, they used a coding scheme that converts longer strings of data bits into DNA sequences.
In the end, they had a DNA library with over 13 million unique DNA 150-base-long sequences. Each snippet starts with a coded address that shows its location in the file. And snippets that belong to the same file are flanked with the same “primer target,” a short DNA strand that is a kickoff point for the polymerase chain reaction.
“We needed to be very careful in how we design the sequences,” Strauss says. Much thinking went into inventive algorithms that craft primer targets that do not coincide with the encoded data or address sequences.
When it’s time to read the data by sequencing the DNA, the researchers use primers for the polymerase chain reaction that amplify only the DNA snippets that belong to a chosen file. All the replicated DNA is sequenced. Finally, a new decoding algorithm that the team has developed clusters together similar looking sequences and uses statistical techniques and error-correction to reconstruct the original sequences that are then decoded to get the digital data.
There is a lot more to be done, Strauss says. “We are looking at automating the process because a few parts of the process are still done by people or by expensive machines,” she says. “We want to make the system robust, automated, and cheaper.”