opportunities and prospects

Greetings to all IT specialists and techies. Having released the next part of the saga about NMR, I experienced catharsis and felt the moral right to again graphomaniacally on abstract topics. And today we will jump into the topic of DNA/RNA data storage. The topic is interesting, and much closer to IT than all previous opuses, so let's go!

Educational program (if you know, you can skip to the next section)

Everyone knows in general terms what DNA/RNA (we’ll just call them NK) is, but when you ask, for example, what exactly the letters D and P mean in these abbreviations, few answer, and when asked why the letter K means “acid” “, although they usually talk about nitrogenous bases, and in general only specialists… Therefore, I will start with a simple one: NAs are macromolecular nucleic acids, consisting of monomer units of the following type:

The backbone of the macromolecule consists of alternating phosphoric acid (yellow circle) and sugar residues in a cyclic form (orange pentagon), and the side chain contains the same nitrogenous base (red, green, blue or purple). That is, the monomer unit of NA is formally amphoteric – it contains both an acidic and a basic center. But the phosphoric acid residue is a fairly strong acid, while nitrogenous bases are rather formal bases, and in general the macromolecule exhibits mainly acidic properties, which is why it is called an acid. At one end, the macromolecule ends with a monosaccharide residue, at the other, with a phosphoric acid residue, so the direction of the macromolecule is determined: we know from which end we should start reading it.

DNA contains four nitrogenous bases: adenine, cytosine, guanine and thymine, while RNA has uracil instead of thymine. There is a certain binary relationship between these bases – complementarity, we will not discuss its nature here, we will simply accept that complementary bases are associated with each other, but non-complementary ones are not. Now regarding D and R: DNA is deoxyribonucleic acid, and RNA is ribonucleic acid. Ribonucleic means that it contains a fragment of the monosaccharide ribose, and deoxyribonucleic means that its ribose is deoxygenated, that is, deprived of one oxygen atom, which is clearly visible in the picture:

Difference between DNA and RNA, deoxyribose has no oxygen at the bottom right.

The principle of storing information in NK is quite obvious – a macromolecule is a string, each element of which has 4 variants of structure, that is, we can store 2 bits in it. Two chains in an NC or one – it doesn’t matter, because complementarity dictates the identity of the chain sequences up to one logical operation. Therefore, DNA is clearly redundant for the purpose of storing information; RNA will suffice, and without loss of generality we will discuss it further. On paper, everything is simple: synthesize an RNA chain with the required sequence – get a line with data, sequence it – read the line.

Chapter zero. Basic characteristics.

Now that the educational program is over, let’s get to the point, and first, it would be good to understand whether there is any profit in all this. If we are talking about cold data storage, then the main parameter that is interesting to us is the density of information storage. A monomer unit in RNA, that is, 2 bits, occupies about 2 nm³, which is very cool. The volume of one bit on modern storage media is difficult to Google, but you can roughly estimate (correct me if I’m wrong). In modern HDDs, up to 2 TB of data are crammed onto one pancake, and the volume (physical) of the disk is about 2 * 10²¹ nm³that is, about 10⁹ nm³ per byte, streamers may have a slightly higher recording density, but not significantly. Solid state drives do the same. These are all calculations for cold storage, that is, relatively speaking, for HDD pancakes lying somewhere without cases, heads and electronics. That is, in terms of recording density, we can, deep in theory, step forward by more than 7 orders of magnitude. It's cool, of course. The question is, is it really possible to implement at least some of these 7 orders in practice?

To use RNA as a practically useful information carrier, it is necessary to ensure:

1) Directed synthesis of RNA with any predetermined sequence.

2) Storage of RNA for a long time (since we are talking about cold data storage) of time.

3) Non-destructive reading of the RNA sequence.

Accordingly, we will analyze these points sequentially: how this can be implemented, what are the achievable parameters of the process, and what are the prospects for its improvement. For definiteness, we will assume that we are synthesizing chains of 200 NCs. Why is this so, since it’s possible to do 300, and it seems like someone else did more? Because further on the output drops significantly, the price rises and the probability of error increases. And 200 already sounds optimistic when you consider such synthesis as a routine recording operation.

Chapter first. Synthesis of NC.

There are two ways to synthesize NA of a given structure: biosynthesis, modification and hmm…non-biosynthesis, that is, chemical synthesis. NA biosynthesis is a template procedure, that is, in order to biosynthesize NA, we must already have some kind of information-carrying biopolymer, for example RNA, DNA, or, in extreme cases, a polypeptide. The situation looks like a dead end, although some life hacks can be extracted from it. When modifying NK, they take natural NK and change certain areas in it, for example, this is how retroviruses, transposons act, and humans can do the same. But to obtain a full-fledged predetermined sequence on the NK in this way is unimaginably difficult and expensive, if not impossible, so this option is not suitable for us, and for our hypothetical databank, modification is rather a possible direction of attack, we don’t need such people here. Although in theory it is possible to store data in separate, pre-known areas of the NK, this is excessive complexity for the sake of who knows why.

One way or another, in the end we will run into where to get the information-carrying biopolymer, and here we have nothing left but chemical synthesis. That is, using the methods of organic chemistry, link by link, build an NK macromolecule. Now we use (for the synthesis of both NAs and polypeptides) a methodology developed several decades ago, practically unchanged – solid-phase synthesis. The principle of solid-phase synthesis is very simple: the NC chain is built up step by step, one monomer unit at a time, on a “tail” attached to the solid phase – a gel or porous polymer substrate. That is, in one cycle += 1 one monomer unit. This cycle, if you look at it more closely, consists of four steps: coupling (addition), capping (termination), oxidation (oxidation) and detrytilation (removal of the protective group). I won’t go into detail about the essence and meaning of the procedures, there is material for a separate post, but I’ll throw a diagram at you, which in principle is quite understandable:

The cycle of NA synthesis, increasing its length by one monomer unit.

At one time, solid-phase synthesis became a breakthrough in the chemistry of biopolymers and opened the way to automation of the process, because it made it possible to completely eliminate the stages of purification of intermediate compounds, which took up at least 90% of the time of the entire synthesis. However, it also has technical limitations: for each step (let me remind you, there are 4 of them per cycle), you need to load a reagent solution into the reactor, wait for the reaction to complete (possibly warm it up), and remove the spent reagent in several washes. The solid phase for the synthesis of NC is a backfill of balls, similar in appearance to millet or slightly larger. The balls are porous, with micron-sized pores, so any mass transfer processes in them are not a quick matter. To replace one solution with another in them requires tens of seconds; the limitation of the rate of diffusion of liquid in a porous body is of a fundamental nature. Manufacturers of modern automatic NA synthesis systems claim 2.5 minutes per cycle and up to 50 minutes of post-processing (removal of protection, removal from the substrate, cleaning), which is infinitely fast compared to traditional solution synthesis, and even with the devices I dealt with (their cycle time was 5-10 times longer) but infinitely slow compared to any storage medium in use or that has been used, including punch cards. To synthesize our 200-unit NA with the most advanced commercial equipment, it would therefore be necessary to: 7 o'clock. Remember, this is the first key parameter of our recording device.

Now about the scale and volumetric efficiency of this whole thing. Theoretically, thanks to PCR, we only need one NA molecule to sequence (read) the sequence. In practice, it is simply impossible to reliably manipulate one NA macromolecule using chemical synthesis (and especially purification) methods. My subjective opinion is that the absolute minimum from which you can start working is a million macromolecules. Although, for the sake of beautiful numbers, there will be even 600,000 macromolecules – that’s 10^-18 mole, or 1 attomole of a substance. This is very small, it is about 60 femtograms. This is the absolute limit, and with the current level of technology even it is unattainable. The synthesis system mentioned above, the size of a home laser printer, promises some fabulous efficiency indicators, 99.5-99.7% per cycle, which is very cool, I would be more careful and limit myself to 99.0%. The yield of NC is promised at a level of ~100 picomoles. Let’s even assume that we were able to downscale this whole thing by 1000 times using microfluidics and lab-on-chip, and get a synthesizer the size of an apple. Well, plus piping, all sorts of pumps, containers with reagents, product supply – this is all external, but this can be generalized to many synthesizers. And from there, ~1 femtomole or 60 picograms of our NC will come out (because scaling is nonlinear, as it decreases, the periphery will occupy an increasingly larger share of the volume, and the reactor, an increasingly smaller one). This is good, this is 1000 times more than the limit, you can work with it reliably (and reliability is important for a recording device) using the available means. In total, once every 5 hours, 50 bytes will come out of a volume of ~100 ml, which corresponds to volumetric efficiency 1 byte l^-1 h^-1. How is the speed already taking your breath away? It's only the beginning…

Chapter two. Storage of NK.

Next, we need to decide in what form we will store the resulting oligoNAs. The problem is that it won’t be possible to organize conditional signed jars with femtomol NK in each so as to implement random access to each one – well, it doesn’t hurt, I wanted it in cold storage. This means that we break the data into fairly large (by the standards of the capacity of a computer as an information carrier, of course) pieces of information that we can only read in its entirety. There are at least several options for how to implement this:

1) Store individually in numbered micro-storages.

The containers must be very small, in fact, they must be capsules each ten or two microns in size, and with a quickly readable number carrier. Doesn't remind you of anything? Yes, this is a cage! Storage can be implemented both in natural cells and, say, encapsulated in a gel, and numbered with a set of luminescent labels, or, again, grafted oligoNA and, if necessary, sorted by a flow cytometer with a microfluidic sorter, or by affinity methods. But, again, it will not be possible to make one cell per oligoNA, just technically – which means that the scale of synthesis will have to be increased. But technically this is still extremely difficult and expensive to implement, so I’m not seriously considering this option yet, and I won’t undertake to evaluate the recording density.

2) Store in macroscopic storage facilities, addressing them by numbers recorded in these circuits themselves.

Everything is simpler here in terms of post-processing – just pour all the products into one container, a so-called centrifuge tube, and that’s it. The advantage over the first option is that the femtomole that we received from our miniature synthesizer is sufficient. In general, the option is quite realistic and technically simple. However, chain numbering eats up part of the useful volume of oligoNAs. How many? If we have 50 bytes on one oligoNK, then we can easily donate, say, 4 bytes per address, receiving more than 4 billion addresses. 4 billion 60 picograms, that is, 250 mg of oligoNA. Taking into account the fact that they need to be stored in a medium, and not in their pure form, they will just fit into a centrifuge tube. In terms of data recording density, this is about 200 MB per 2 ml, that is (I am very, very roughly rounding) about 10¹⁴ nm³ per byte. Plus, the problem is that not only all sequencing methods can separately sequence oligoNA in a mixture, especially in such a hellish one (we will talk about this in the next chapter). But separating 4 billion NCs using macroscopic methods is unlikely to ever be possible.

3) Crosslink into a long NC using biochemical methods.

On the one hand, this is the most correct method of storage from the point of view of supreme justice. But technically this is a huge hemorrhoid with both writing and reading. To stitch two oligoNAs into one, ligases are used. By and large, we need that at the ends of the fragments supposed to be stitched, there are obviously known sequences, and a fragment complementary to this pair (let’s call it a patch), a kind of welding magnet that temporarily holds these fragments together before welding (NK ligase). Nothing fundamentally complicated, but it eats up the useful length of the oligoNA, is quite expensive and has a relatively high error rate. In theory, for a fairly large set of oligoNAs, it is possible to synthesize a set of patches that are pairwise complementary to the terminal sequences of the oligoNAs, and sew it all in the correct sequence in one pass, but firstly, the error percentage will increase for completely indecent values, and secondly, how then share this compote? The second option is to sew sequentially. Then you can get by with one patch and significantly reduce the complexity of separation, but at the cost of time. As an advantage, here it will really be possible to get by with a very small number of “long” NCs during storage. I would say that the recording density here will be 2-3 orders of magnitude higher than in case 2 (according to my calculations below, ~10¹² nm³ per byte). In general, if we don’t mind days and even weeks of recording, the option is more or less realistic if we stitch it somewhere up to a length of 5-10 thousand bases. Although I’m not that keen on NK ligases, there are probably many pitfalls in this process.

It is recommended to store ex vivo oligoNAs at low temperatures, from -20 to -78 C. Why this is so is not very clear to me, especially when it comes to short oligoNAs without hyperstructure. In my experience, small oligoNAs are perfectly stored in a regular refrigerator (+2-5 C) for many years in a row. In vivo, as you understand, NCs are stored for decades at a temperature of 36.6 C, and if you do not interfere with them by external factors such as carcinogens or radiation, nothing is done with them. Therefore, for now, let’s agree that we don’t need a cryobank, at most a large refrigerator. Using something like numbered 96-well plates or tube racks for storage is quite sufficient, perhaps increasing the well/hole density to optimize available space. The recording densities that we considered would need to be divided by 2-4, taking into account how densely media containing NCs can be crammed into these tablets/tripods, and how much space the transportation systems and other related infrastructure will consume.

Chapter three. Reading with NK.

Here we will talk about the part of this whole process that is most familiar to people – sequencing, that is, determining the sequence of NK. So, at a minimum, in order for the reading to be repeated, before reading it is necessary to amplify our “data block” – a mixture of short oligoNAs or long NAs. Fortunately, PCR – polymerase chain reaction – will help us here.

For classical PCR, you need a primer – a small, 20-30 bases, oligoNA, from which replication will begin (don’t pick on the terminology). In this regard, it looks most advantageous storage option 3 on long NCs. Because the loss of 20-30 seed bases on a chain tens of thousands long is generally insensitive, and in one conventional test tube there will be quite a few different long NCs (1-10 million, versus 4 billion in a soup of short NCs, if stored in microcontainers for 10-20 µl, then you can reduce the amount by 100 again, to 10-100 thousand). This gives a real opportunity to prepare a complete set of primers and run them one by one when reading is necessary. That is, such normal access to addresses is not exactly cold storage, but rather cool. We selectively amplify a specific long NA, sequence it using any available method, preferably, of course, NGS (more on sequencing methods separately). You can also amplify from an arbitrary location of the NK, having the appropriate primer, but it is not necessary, we still have cold storage. And so, if we have a chain of 10 thousand bases (approximately the limit of reliable amplification of a fragment), then within the limits of error we will have a memory cell of 20 KB. Then one microcapacity with, say, 100 thousand different NCs will be a cluster of 2 GB in size. Taking into account the size of the walls, etc., the volume of such a container will be within 0.5 ml, which gives the density ~10¹² nm³ per byte. Already closer to HDD, but still 3 orders of magnitude separate us, or rather them.

If we talk about soup from oligoNK option 2, then eating 20-30 bases from a chain of 200 is somehow strangling the toad. Although this is only 10-15% of the data volume, you can survive. If we use primers, then we need ~4 billion of them (see above). Moreover, they are needed in fairly large quantities (they are used in each reading operation), and they must be stored separately so that one can be physically selected and placed in the soup for selective amplification of the desired oligoNA. This is possible, of course, but very cumbersome. Let's say, if you store primers even in 15 ml centrifuge tubes, this is ~ 60,000 m³ the net volume, taking into account the strapping and unused space, is several times more, that is, a separate large building for storing primers, plus complex, expensive and slow logistics. But there are options. First, there is primerless amplification. She, of course, copies the entire oligoNK without variations, but that’s what we need. Now this is not a commercial technology, but a research method, but there is nothing fundamentally complicated about it. It’s just that the healthcare industry doesn’t need it, so it’s not commercialized. Well, the second, most cheating option – with banal, cheap, crude and clumsy chemistry you can easily obtain an exhaustive mixture of oligoNAs of approximately a certain length (say, in the range of 20-30 bases). Exhaustive – means that all possible combinations will be presented there, and in proportions approximately corresponding to combinatorial ones. This is done in a solution and not in a solid phase, so you can make a lot of this primer soup. And then, we use this primer soup to amplify our soup from oligoNA, we get… a lot of the same soup, enough for sequencing.

MinION sequencer. Even now it's quite a compact thing.

But here, of course, there are problems. Firstly, I’m talking so wildly here about amplifying a soup of billions of oligoNAs, but how well this will work in practice, and with what percentage of errors, I don’t know. In theory, the longer the primer, the fewer errors, but the more we eat from the data. In general, it can be solved; the question is finding a balance. Secondly, not only all sequencing methods will help here. I would even say literally one and a half. Classic methods, like the Sanger method, are eliminated immediately. Of the NGS (at least as I see it), only the MinION method is suitable. The essence of the method is simple: There is a complex supramolecular structure (scanner or reader) through which the NA chain is pulled. The passage of each base changes the electrical potential of the structure, which can be detected directly (yes, they can directly measure potential changes in such small objects with electrodes) or indirectly. Accordingly, an essentially digital signal with five logical levels is recorded – zero and four base options. Chain by chain, this device can read all the oligoNK variants present in the mixture, and if there is more than one, the reading, of course, will be faster. The disadvantage of the method is the high rate of errors. Firstly, the error of the method itself, and secondly, the loss of data: purely statistically, there is a non-illusory probability that among the 4 billion variants of oligoNAs present in the solution, some of them will never fall into the tenacious clutches of the scanner. Or it is necessary to ensure a high read fraction, which will increase sequencing time. In short, the situation is approximately the same as with the loss of packets during data transmission. But they are quite capable of dealing with this, and I am sure that, if necessary, they will also combat losses during sequencing of soup. Hashing, multiple sequencing, or anything fancy, I'm not good at that. Well, there is also half of the method – multidimensional gel electrophoresis followed by sequencing by other methods, for example Illumina-SOLEXA. Half because a) I don’t know whether multidimensional electrophoresis will, in principle, be able to separate such a number of not-so-short oligoNAs, and secondly, whether the purely quantity separated will be enough for sequencing “as is.” The soup method, it seems to me, is good for maximum cold storage, when access to data is needed as rarely as possible and to all at once.

The reading speed essentially depends on the capabilities of MinION, so in the limit it is approximately the same for both methods: about 0.1 s per base, that is, 0.4 s per byte (we do not take into account amplification, since there is only one for very, very many oligoNAs). The MinION sequencing system, after some modification, can easily fit in a volume of 10 ml, so it will read 250 bytes in 1 s in a reader (multi-threaded) with a volume of 1 liter, that is, 900 KB h^-1 l^-1 . For method 2 it is necessary to take into account the fight against packet loss, which will reduce the reading speed, say, by an order of magnitude, that is 90 KBh^-1 l^-1, which is 5 orders of magnitude higher than the recording speed. Using method 3, you can both selectively amplify a single NK and analyze it using the same scheme as the soup from method 2. Therefore, in the limit, the reading speed + is the same, but there is the possibility of slower and more expensive access to addresses.

Bottom line

In summary, so far I personally don’t see how to approach the density of information recording to the media currently used. The write speed is generally extremely sad (the read speed is higher). Regarding reliability, it is also doubtful, because… The infrastructure of such a storage facility is in any case very large, complex and expensive. In numbers, the maximum practical indicators look like this:

1) Recording density from 10¹² nm³ byte^-1

2) Volumetric recording speed 1 byte l^-1 h^-1

3) Volume reading speed 90 KB l^-1 h^-1

4) The price is too damn high. I would say that the storage, regardless of the volume, will cost several tens of millions of dollars.

The main problem is that we cannot operate with single NC macromolecules, or even hundreds or thousands. We need millions. The bottleneck is precisely the writing process (oligoNA synthesis), because reading (amplification-sequencing) can be carried out with smaller quantities. Although amplification of even 60 picograms of oligoNA is no longer a completely routine task by the standards of today's commercial technologies, there are still no fundamental restrictions on increasing sensitivity.

Immobilization of NCs on solid substrates sounds promising. Using STM or TEM, you can read the sequence of even a single NC, but how to place these single NCs there, not in a random manner, but in some orderly manner, and, most importantly, not in a laboratory experiment that takes many days, but in the format of routine technology, is not at all clear . You can think about expanding the range of values of the logical element of the circuit. Simply put, use not 4 natural bases, but more, supplemented with synthetic ones. The problem is how to amplify it, and whether the same MinIONs will be able to sequence it.

On the other hand, there is proteomics, where there are even 20 natural amino acids. Will this give much? Well, the increase in recording density is more than 2 times due to an increase in logical density, plus the monomeric unit of the peptide is 3 times smaller in physical volume than NK. On the other hand, everything is more complicated with both amplification and sequencing. Although, there is hope that an analogue of MinION will be made for proteomics. And, well, storage is much easier; there is no need to refrigerate it at all, especially if cysteine is excluded from the list of amino acids.