Is DNA the future of digital data storage?

A portrait of Rosalind Franklin with a secret hangs on the wall of the Bill & Melinda Gates Center for Computer Science and Engineering at the University of Washington in Seattle, USA.

The portrait is five years old and is a black acrylic ink painting of Franklin on top of a collage of nearly 2,000 photographs. All of the images are snapshots of precious memories sent by the public to Luis Ceza, a professor of computer science and engineering. But the real surprise is in the medium used to create Franklin's painting. The acrylic ink contains synthetic DNA, which encodes all the digital information needed to reproduce each photograph in the collage. The photographs were encoded in DNA, which was then applied in the ink to create the portrait of Rosalind Franklin, a pioneer in DNA research,” says Karin Strauss, a senior research manager at Microsoft Research in Washington, D.C.

  Each photograph in this composite portrait of Rosalind Franklin is also encoded in ink using DNA.

Each photograph in this composite portrait of Rosalind Franklin is also encoded in ink using DNA.

The idea of ​​storing digital information in the form of adenines (As), thymines (Ts), cytosines (Cs), and guanines (Gs) in synthetic DNA has been around for decades. It’s a more compact and durable alternative to the binary code (strings of 0s and 1s) used in traditional computing. Over the past decade and a half, there’s been a flurry of examples of storing data in DNA. Other demonstration projects include storing 154 of Shakespeare’s sonnets, a portion of Martin Luther King Jr.’s 1963 “I Have a Dream” speech, and the first episode of the Netflix series Biohackers.

“The idea of ​​storing digital data in DNA is not a completely new concept, but it is becoming more and more viable,” says Söze. A major step forward was the creation of the DNA Storage Alliance in 2020. This major industry and academic collaboration is creating an interoperable storage ecosystem in which the technologies for every step of data storage and acquisition are compatible. This will help avoid a repeat of the videotape format wars that pitted incompatible Betamax and VHS systems against each other in the late 1970s and 1980s.

Solving the data problem

The logistics of storing and retrieving DNA data vary among the demonstration projects, but the basic steps are the same. First, the data is encoded as a pattern of nucleotide bases, just as it is currently encoded as 0s and 1s. Then, multiple copies of DNA strands with that pattern of bases are synthesized in the lab. The DNA is then stored for a period of time. To extract the information, the base pattern in the DNA is read using sequencing technologies (which were originally developed for genomic and medical research). The DNA can be recovered from, for example, Söze’s portrait of Franklin by scraping off a little paint.

Storing information in DNA has many advantages over existing methods, the most important of which is durability. The information stored by our ancestors allows us to look into their world. You can still see cave paintings from prehistoric times, examine hieroglyphs carved into rocks, and read books from the 11th century. Modern storage media, on the other hand, are not designed to last. The materials degrade fairly quickly, and the playback technology quickly becomes obsolete, making it difficult to retrieve data that is more than a decade old. How many users can still access data stored on vinyl records, cassettes, videotapes, floppy disks, or zip disks? Most new laptops no longer even have a CD or DVD drive. Museums and companies that store large amounts of data understand that there is a problem with the fact that we don't know how to store data for long periods of time, says Robert Grass, professor of functional materials at ETH Zurich in Switzerland.

The basic steps of encoding and decoding data using DNA are generally the same for different methods.

The basic steps of encoding and decoding data using DNA are generally the same for different methods.

DNA is what nature uses to store the information that all living things need to grow, reproduce, and function. When stored properly, it lasts for many thousands of years. In recent years, scientists have read DNA from the teeth of mammoths that are one million years old, and have found evidence of horseshoe crabs and mastodons in environmental DNA that is two million years old. DNA lasts a very, very long time, especially if it is stored without oxygen, water, and in the dark, says Emily Leproust, chief executive of Twist Bioscience in San Francisco.

It is also very likely that future generations will retain the ability to read DNA. DNA is so important to human health that you will always be able to read DNA. Maybe in 100 years we will not be using Illumina or PacBio, it will be different sequencing technologies, but we will always be able to read it,” says Leproust.

Then there’s the issue of density and power consumption. Data stored “in the cloud” is actually stored in huge data centres scattered around the world. The Cardiff Data Centre Campus, for example, is about 140,000 square metres and uses 270 megawatts of power, enough to power a small city. By contrast, “DNA is very dense,” says Leproust. “You can fit dozens of data centres in an area the size of a sugar cube.” It requires no energy to store, and in a sealed container it will last for thousands of years, provided it’s dried and kept cool enough.

Time Capsule

Before DNA data storage can become mainstream, two major hurdles need to be overcome: reducing the cost of DNA synthesis and sequencing, and increasing speed. Significant efforts are currently underway to achieve these goals. However, it is unlikely that DNA data storage will be cheap and fast enough to completely replace electronic data storage. Instead, it is expected to fill niche gaps in the market for purposes such as archiving data that needs to be stored for long periods of time without the need for frequent retrieval. This category includes culturally significant data, legal documents, and important government information. “I’ve spoken to the UK National Archives and the British Library,” says Thomas Heinis, reader in computing at Imperial College London, UK.

Phosphoramidite chemistry is the main approach used today to synthesize DNA in the laboratory. Synthetic DNA is cross-linked one nucleotide at a time by forming covalent bonds between the 3′-phosphite ester groups and the 5′-hydroxyl groups on adjacent deoxyribose sugar units. Because nucleotides are added one at a time and each addition requires protection and deprotection steps, creating synthetic DNA is a labor-intensive and expensive process.

Miniaturizing DNA synthesis is one way to reduce costs. DNA is typically produced in 96-well plates, with each well holding a single DNA fragment. Twist Bioscience has developed an inkjet-based platform that can create 1 million DNA fragments at a time. It uses silicon chips (similar to those used in the semiconductor industry) that are micropatterned with tiny wells where the chemistry takes place. The platform “uses 99.8 percent less chemicals” for each DNA fragment created, Leproust says. “Because we use fewer reagents, it’s cheaper,” she adds. Twist Bioscience’s technology is already being used to create custom synthetic DNA strands for vaccines, drugs, diagnostics, and other biotech applications. Early access to its DNA storage service is planned for 2025, Leproust says.

A future full of mistakes

Another approach being used to reduce the cost of synthesis (and speed up sequencing) is the use of error-correcting codes. These extra bits of DNA correct any errors so that the information can be read. Data strings in electronic storage also contain redundant code that can be used to correct errors if something goes wrong. The ability to correct errors in data read at the end of the process opens the possibility of using less precise, but cheaper and faster, synthesis and sequencing tools. “We can do something at the level of coding, in the DNA information itself, to deal with errors,” explains Jeff Nivala, an assistant professor of computer science and engineering at the University of Washington. “Then I can make something with a very high error rate in my device.” [синтеза или] sequencing, because it will be easy for me to correct them.” Error correction codes can also cope with errors that occur during storage. For example, compact discs with light scratches on the surface can still be played back thanks to error correction codes.

  Various methods can be used to correct possible errors.

Various methods can be used to correct possible errors.

Among the lower-precision DNA synthesis methods currently being explored for data storage is massively parallel light-directed synthesis with the addition of error-correcting codes. Grass and his co-authors, including Mark Somoza, a professor of chemistry at the University of Vienna in Austria, are pioneering this approach. “We can synthesize about 2 million sequences in parallel,” Somoza explains. The process removes the protecting groups on the 5′ hydroxyl using ultraviolet light in a flow-through system to which the necessary reagents for each step are cyclically added. “Usually, there is an acidic label protecting group at the 5′ site, and we replaced it with a photolabeled group,” Somoza says. During deprotection, an array of micromirrors is used to precisely direct ultraviolet light onto the DNA surface. The rest of the chemistry is very similar to traditional DNA synthesis methods. Light-directed synthesis is significantly cheaper and faster than conventional DNA synthesis. Using this approach, the team demonstrated flawless data recovery from a file containing Mozart’s sheet music.

AI to the rescue

Olgica Milenkovic, a professor of data science at the University of Illinois at Urbana-Champaign, is exploring a different approach to dealing with errors: artificial intelligence (AI). “Synthetic DNA is so expensive that using coding to correct errors would be a huge overhead,” she explains. “We use a set of [уже разработанных] machine learning and artificial intelligence techniques to make DNA-encoded images look better when there are errors, rather than trying to fix them.” This approach is not suitable for data that requires high accuracy, but it works well for images where AI tools already exist to “fix” damage in old photos so that it is no longer visible to the naked eye.

Milenkovic also developed a different approach to recording the information stored in DNA in two ways. The image information is placed into the nucleotide patterns of the synthetic DNA using traditional DNA synthesis methods. Copyright information and patterned watermarks are then added to the DNA backbone. These are binary code and are created by enzymes that make notches. “If you have a notch, it’s a one; if you don’t have a notch, it’s a zero,” Milenkovic explains. Using two layers of encoding allows more information to be stored in the same space. Milenkovic and her team used their approach to store and play back eight stills of Marlon Brando.

Enzymatic synthesis is also being explored to create synthetic DNA. The technology is less mature than phosphoramidite chemistry, but it has the potential to be a faster, cheaper alternative that doesn’t require toxic chemicals. “Phosphoramidite chemistry is very messy, very toxic, very complex, and very expensive,” says Söze. “We need to really focus on enzymatic synthesis to make it controllable and high-throughput.” Kern Systems, which spun out of George Church’s lab at Harvard, and the French company DNA Script are among those pushing enzymatic DNA synthesis for data storage.

Speed ​​reading

Sequencing-by-synthesis methods, such as Illumina's sequencing platforms, are currently the gold standard for reading the data stored in DNA. Nanopore sequencing is gaining momentum due to its ability to sequence single DNA molecules without the need for amplification. These devices use molecular motors to push strands of DNA through pores in a polymer membrane containing a detector. Ions in the surrounding solution pass through the pores, creating an electrical current, and as each base passes through the pores, a different (measurable) distortion of that current is created. “This technology is a really great way to [добиться] “high-precision sequencing,” explains Nivala.

Neither sequencing by synthesis nor current nanopore sequencing machines are fast enough to be used for DNA storage. Commercial Oxford Nanopore machines, for example, top out at 400 bases per second. Faster, cheaper nanopore sequencers are being developed in academic and commercial labs around the world. “If we can get rid of these molecular motors and use electrophoretic energy or voltage energy in the machine itself, you can push DNA strands through nanopores orders of magnitude faster,” adds Nivala. Lower sequencing precision will also lower the price per gigabyte of data. By comparison, Oxford Nanopore’s single-use, Flongle-sized capsules cost $90 each and can sequence up to 2.6 gigabytes of data in 16 hours — about the amount of data needed to store a movie like Star Wars: The Last Jedi in standard definition.

Automating the entire data storage process is another area. The synthesis and sequencing technologies are already largely automated, but the intermediate steps are still done manually. When we do a DNA data storage experiment, there are a lot of PhD students moving around the lab pipetting the material, explains Grass.

For DNA data storage to go mainstream and be used for more than just archival data that is rarely accessed, the write-store-read cycle needs to be fully automated, Strauss and Ceze wrote in a 2019 paper describing an automated end-to-end example of DNA data storage. Their benchtop setup first converts the data (with an additional error-correcting code) from zeros and ones to As, Ts, Cs, and Gs. These bases are then fed, one by one, onto a column, where they are cross-linked together using phosphoramidite. Once the strands are ready, they are washed off the column’s solid supports and placed in a storage bottle. To retrieve the data, the liquid is pumped into Oxford Nanopore’s MinION, where the DNA is sequenced. Finally, this code of As, Ts, Cs, and Gs is decoded back into zeros and ones. In their first demonstration, the scientists sent the word “hello” in 21 hours. “While we demonstrated that it was possible to fully automate an end-to-end DNA data storage system at low cost, it was not a high-throughput system,” says Strauss. Work is currently underway to expand and accelerate this automated approach.

Of course, there’s still a long way to go before DNA data storage becomes mainstream. Those working in the field believe it’s only a matter of time before massive, energy-hungry data centers are replaced by tiny capsules of DNA that our ancestors will be able to access thousands of years from now. Which begs the question: What message would you like to leave for those who follow in your footsteps?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *