Sudoku Hints at New Encoding Strategy for DNA Data Storage

New encoding method makes it possible to come close to the theoretical maximum for DNA data storage

4 min read

DNA double helix
Photo: iStockphoto

Researchers affiliated with Columbia University and the New York Genome Center have reported a new encoding method that makes it possible to come close to the theoretical maximum for DNA data storage.

In research published in the journal Science, the team says its encoding method achieved a 60% increase in storage capacity over previously reported efforts, resulting in a jaw-dropping storage density of 215 petabytes per gram of DNA. For perspective, one petabyte is equivalent to 13.3 years worth of HD video.

Last year, Microsoft announced that its researchers had set the DNA data-storage record of 200 megabytesWhile that was a good indication of how far DNA data storage had come, it remained pretty short on detail, with the announcement coming only in blog post on the Microsoft website. A peer-reviewed paper seemed to be sorely lacking to those in the field.

“Our work is the first one in the literature to show that you can get very close to the theoretical capacity of DNA storage architecture,” said Yaniv Erlich, Assistant Professor of Computer Science at Columbia and Core Member of the NY Genome Center, in an interview with IEEE Spectrum. In fact, Erlich and his co-author Dina Zielinski of the New York Genome Center report coming within 14% of the theoretical limit.

“We reached a DNA density of information of 215 petabytes per gram of DNA so we can get to 100 times more density of information than previous results in this domain,” said Erlich.

In the video below Erlich describes their approach to DNA coding and its benefits.

In order to store digital data on a DNA molecule you need to find a way to translate the binary code of 0s and 1s into the genetic code of As, Cs, Gs, and Ts that make up a DNA molecule. A piece in IEEE Spectrum from last year goes through this process thoroughly and makes for a good primer on the subject.

The basic gist is that that you encode the digital data into the DNA. Then, when you want to retrieve the information, you use a standard DNA sequencing machine to decode the material inside. Once the machine gives you the DNA sequence, you can translate it back into binary code to read your original file.

This sounds pretty straight forward, but there are two main challenges when information is encoded on DNA, according to Erlich.

“The first one is that not all DNA molecules are created equally,” he explained. “If you have DNA molecules that have a long stretch of the same nucleotide, such as AAA, it is not very favorable for the informatics machinery. It’s very hard to read this molecule without an error. So you want to avoid stretches like that.”

The second main issue is that not all DNA molecules are going to make it. Some molecules and sequences you will not see in the end because of some of the stochastic noise that you have propagating in these molecules.

“You need to overcome these two challenges,” added Erlich. “This is why we developed DNA fountain, which is a method to overcome these challenges and find one solution for these two problems.”

The DNA fountain works a lot like a Sudoku puzzle, which has become an algorithm design approach for computer scientists. A Sudoku puzzle is comprised of a 9-by-9 grid consisting of nine 3-by-3 subgrids. Digits appear in some squares and based on these hints, a player tries to complete the grid so that each row, column, and subgrid contains the digits 1 through 9 exactly once.

Erlich and Zielinski liked the idea of hints in Sudoku puzzles. So for their DNA storage scheme, instead of encoding the information directly, they send hints about the data file.

“The thing about the fountain program is that it can generate as many hints as you want, so what we do on the computer is we let the fountain code run, generate the hints and we map these hints into the DNA molecules on the computer into DNA sequences,” explained Erlich.

“Then our program evaluates the DNA sequences and sees whether these sequences seem like a good sequence that we can synthesize. We only take the hints that have the right chemical features that we want. So we address the first problem about DNA not having sequences that are created equal, we just take DNA sequences that look good. The second thing is that we have many more hints than what you need to solve the content of the file; it is over determined therefore you can lose some of these molecules and still decode the information back.”

While some observers believed Microsoft’s announcement of storing 200 megabytes of digital information foretold of DNA digital storage being around the corner, Erlich remains more cautious about the timeline despite his recent breakthrough.

“My thoughts are that basically that it will take more time—5 to 7 years is too short,” he said. “There are still more challenges to solve here.”

Even with an extended timeline, Erlich cautions that DNA data storage will not be taking the place of our hard disk drives in our computers, but will more likely be a service that users will access from the cloud to store a large amount data for a long time.

Just as encouragement for people to check their work, Erlich encoded the DNA with a $50 Amazon gift card. If you figure out the code, you’ll get the gift card. One of Erlich’s Twitter followers has already tested the code and bought a book, so it should be repeatable.

The Conversation (0)