The July 2022 issue of IEEE Spectrum is here!

Close bar

DNA Data Storage Gets Random Access

Researchers have devised a system to recover targeted files from 200 megabytes of data encoded in DNA

3 min read
Illustration of DNA with binary information.
Illustration: iStockphoto

DNA data storage just got bigger and better. Scientists have reported the first random-access storage system from which they can recover individual data files, error free, from over 200 megabytes of digital information encoded into DNA.

Random access is key for a practical DNA-based memory, but until now, researchers have been able to achieve it with only up to 0.15 megabytes of data.

Since submitting their research, published in Nature Biotechnology, the team from Microsoft Research and the University of Washington has already improved on what they reported. Their storage system now offers random access across 400 megabytes of data encoded in DNA with no bit errors, says Microsoft Research’s Karin Strauss, who led the new work with Luis Ceze from the University of Washington.

Microsoft and other tech companies are seriously considering the possibility of archiving data in DNA. Current data storage technologies are not keeping up with the breakneck pace at which we generate digital content, Strauss says. Synthetic DNA is an attractive storage medium because it can, in theory, store 10 million times as much data as magnetic tape in the same volume, and it survives for thousands of years. Technology Reviewreports that Microsoft Research aims to have an operational DNA-based storage system working inside a data center toward the end of this decade.

DNA data storage involves translating the binary 0s and 1s of digital data into sequences of the four bases A, C, G, and T that make up DNA. The encoded sequences are synthesized and stored in vials. A DNA sequencing machine then decodes the data by recovering the sequences from DNA molecules. But it has been hard to access specific data files. Most research efforts until now have sequenced and decoded the entire bulk of the information stored in a vial. “It is not economical to sequence all the data you have stored every time you want to read a portion of it,” Strauss says. 

To make a random access system, Strauss, Ceze, and their colleagues devised clever coding algorithms and turned to the polymerase chain reaction, a well-known lab technique used to make thousands of copies of DNA strands, called amplifying DNA.

The researchers worked with 35 files ranging in size from 29 kilobytes to over 44 MB, which they have stored in DNA before. They encoded each file into a large number of 150-base-long DNA snippets. They used the error-correcting Reed-Solomon code, but unlike their previous work, they used a coding scheme that converts longer strings of data bits into DNA sequences. 

In the end, they had a DNA library with over 13 million unique DNA 150-base-long sequences. Each snippet starts with a coded address that shows its location in the file. And snippets that belong to the same file are flanked with the same “primer target,” a short DNA strand that is a kickoff point for the polymerase chain reaction.

“We needed to be very careful in how we design the sequences,” Strauss says. Much thinking went into inventive algorithms that craft primer targets that do not coincide with the encoded data or address sequences.

When it’s time to read the data by sequencing the DNA, the researchers use primers for the polymerase chain reaction that amplify only the DNA snippets that belong to a chosen file. All the replicated DNA is sequenced. Finally, a new decoding algorithm that the team has developed clusters together similar looking sequences and uses statistical techniques and error-correction to reconstruct the original sequences that are then decoded to get the digital data.

There is a lot more to be done, Strauss says. “We are looking at automating the process because a few parts of the process are still done by people or by expensive machines,” she says. “We want to make the system robust, automated, and cheaper.”

The Conversation (0)
A photo showing machinery in a lab

Foundries such as the Edinburgh Genome Foundry assemble fragments of synthetic DNA and send them to labs for testing in cells.

Edinburgh Genome Foundry, University of Edinburgh

In the next decade, medical science may finally advance cures for some of the most complex diseases that plague humanity. Many diseases are caused by mutations in the human genome, which can either be inherited from our parents (such as in cystic fibrosis), or acquired during life, such as most types of cancer. For some of these conditions, medical researchers have identified the exact mutations that lead to disease; but in many more, they're still seeking answers. And without understanding the cause of a problem, it's pretty tough to find a cure.

We believe that a key enabling technology in this quest is a computer-aided design (CAD) program for genome editing, which our organization is launching this week at the Genome Project-write (GP-write) conference.

With this CAD program, medical researchers will be able to quickly design hundreds of different genomes with any combination of mutations and send the genetic code to a company that manufactures strings of DNA. Those fragments of synthesized DNA can then be sent to a foundry for assembly, and finally to a lab where the designed genomes can be tested in cells. Based on how the cells grow, researchers can use the CAD program to iterate with a new batch of redesigned genomes, sharing data for collaborative efforts. Enabling fast redesign of thousands of variants can only be achieved through automation; at that scale, researchers just might identify the combinations of mutations that are causing genetic diseases. This is the first critical R&D step toward finding cures.

Keep Reading ↓Show less