protein structure prediction

Pavel Buzin is with you again. As promised in the first part of the Nobel Prize sagatoday we’ll talk about chemistry. I hope the public’s interest in the news has not yet faded, because, I admit, even I, a technically savvy person, took a lot of time to understand the chemical component of the research of this year’s laureates.

October 8, 2024 can now be considered one of the most important dates in the history of artificial intelligence, because the Nobel Prize in Chemistry was essentially awarded for the application of AI methods. American David Baker and employees of Google's subsidiary in Britain Demis Hassabis and John M. Jumper took the prize for predicting the structure of proteins. Stop. Since when did Google become a chemical company? In this article, we will look at the background of the researchers and the methods they discovered a little more carefully.

Demis Hassabis and John Jumper are, respectively, the CEO and director of DeepMind, a Google subsidiary specializing in the development and application of artificial intelligence methods. It is worth mentioning that Nobel Prizes have previously been repeatedly awarded for discoveries made by employees of research departments of corporations. You can remember Bell Labs (now a division of Nokia Corporation), whose researchers created the first transistor and discovered cosmic microwave background radiation. Or researchers from IBM who received awards for creating a tunneling microscope and discovering high-temperature superconductivity.

Prizes have been repeatedly awarded for the development of new tools and methods that allow expanding capabilities and obtaining radically new scientific results. A striking example from physics is the Wilson chamber (1927), in chemistry – methods for the synthesis of molecules with specified properties “click chemistry” (2022) and much, much more.

Computational methods, which are now united under the name Computer Science, have also been repeatedly awarded high awards. Most of them are Nobel Prizes in economics (due to the absence of a prize in mathematics), here as an example we can mention Leonid Kantorovich (1975), the father of linear programming, who received the prize with the wording “for his contribution to the theory of optimal resource allocation.”

What did they do in the field of computer science this time that was recognized as revolutionary in chemistry?

Geometric chemistry

The Nobel Committee's press release reads: “The Royal Swedish Academy of Sciences has decided to award the Nobel Prize in Chemistry 2024 … “for protein structure prediction.”

Already interesting. The study of proteins is one of the most important areas of modern science, covering chemistry, biology, medicine, pharmaceuticals, and computer science. Researchers in this area face three big challenges:

complexity and high cost of conducting experiments;
protein molecules consist of long chains made up of amino acids, and the number of their combinations is very large;
Like any long molecules, protein molecules have a large number of internal degrees of freedom and, when twisted, can take on different shapes, including those with an internal ordered or disordered structure.

In this case, different parts of the molecules can geometrically enter into mechanical engagement, which is not accompanied by the formation of new chemical bonds. And this is already beginning to greatly influence the properties of proteins when interacting with other agents – chemical reagents, other proteins, drugs, viruses, various elements of cells.

In chemistry, there are often cases when the formula of a substance does not change, but changing the position of the molecules affects how this substance interacts with others. Everyone knows that water and ice are the same molecules of H2O, however, ice occupies a larger volume and reacts less readily. Understanding the conditions under which proteins spontaneously change the geometry of their molecules, or knowing what factors initiate this, is extremely important. This affects things like:

shelf life of drugs and reagents;
the formation of inactive or even toxic protein isomers;
spontaneous change or restoration of molecular geometry over time.

The process of transforming the chain of amino acids that make up a protein molecule into an ordered structure is called protein folding.

https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/ProteinogenicAminoAcids.svg/600px-ProteinogenicAminoAcids.svg.png

A typical protein molecule is a chain of hundreds and thousands of carbon atoms having bonds with nitrogen, oxygen, sulfur atoms, cyclic and acyclic compounds, hydroxyl groups and much more. Previously, it was impossible to know in advance what spatial origami the chain of amino acids would eventually be “packed” into. Just imagine: a protein with a chain of 100 amino acids can be packaged into 1047 different configurations. There are about three dozen amino acids themselves, but the matter is further complicated by the fact that enantiomer amino acids can initially participate in protein synthesis: these are molecules that are mirror copies of each other in space (like the right and left hand). Because of their geometric properties, enantiomers interact with other molecules differently. Thus, the total number of variants of amino acids, their combinations, enantiomers and ways to pack all this stuff into a protein molecule exceeds the number of particles in the Universe. Fortunately for researchers, nature is not so diverse, and if you observe the folding and unfolding of amino acid chains “in vitro,” it turns out that they are still arranged in a limited number of configurations. The number of options is limited due to the fact that different spatial configurations of a molecule have different internal energies and molecules tend to accept configurations with the lowest internal energy as the most stable.

https://upload.wikimedia.org/wikipedia/commons/thumb/1/12/Milchs%C3%A4ure_Enantiomerenpaar.svg/298px-Milchs%C3%A4ure_Enantiomerenpaar.svg.png

The topic of enantiomers is also important because in the process of biological evolution on earth, all proteins adopted only one mirror orientation, called left-handed, which is manifested in the direction in which polarized light rotates when passing through a solution containing an enantiomer. The separation of enantiomers is impossible by physical or chemical means without destroying the molecules. In this case, mirror copies can be inactive or even poisonous, and to obtain a pure enantiomer, it is necessary to use a “seed” in the form of a natural molecule during the synthesis process, which sets the required configuration.

Before we dive headfirst into the world of protein chemistry, let's brush up on a few terms we'll need later.

In biochemistry, sequences consisting of amino acids and sugars that do not change during protein synthesis reactions are called residues. The concept of a residue is broader than groups of atoms (such as hydroxyl, carboxyl or amino groups) because it can have a more complex structure and consist of several groups.

We will also need to understand the peptide bond – this is when the amino groups (-NH2) of one amino acid interact with the carboxyl group (-COOH) of another amino acid, establishing a C=N bond to form a free water molecule.

And we need to know about the three main ways of conducting experiments in biology and chemistry: in vivo (in a living organism), in vitro (in a test tube) and in silico (on a computer). Modeling of chemical and biological processes is vital due to the complexity and high cost of full-scale experiments.

So, here we go: in silico.

A brief history of proteins “in numbers”

One of the largest protein databases – Protein Data Bank (PDP)contains information about 225 thousand proteins and other structures. While the total number of known proteins that have not been described in detail is 200 million.

The main methods that were previously used to predict the shapes of molecules were based on the solution of equations describing the distribution of electrons in atoms (as a development of approaches to solving the Schrödinger equation) and on the geometric properties of atoms, which can be obtained as experimental data (x-ray diffraction, tunneling microscopes, statistical and other methods).

Solving the Schrödinger equation for a single atom provided information about orbitals—single-electron wave functions that provide insight into the distribution of electrons in an atom. Solving the problem for a diatomic system, even at the current level of development of supercomputers, is still impossible.

The use of a geometric approach allows us to understand how chemical reactions occur for relatively simple molecules. Due to quantum mechanical effects and temperature fluctuations of atoms in a molecule, the molecule continuously trembles and changes its shape, taking on some stable states. The figure below shows possible changes that can occur: changes in the angles between bonds with neighboring atoms (αi, βi, γi and others), rotation around an interatomic bond (ωi, φi-1, ψi-1), changes in the distance between neighboring atoms ( interatomic bond acts as a spring/oscillator). I note that atoms in the process of movement can take not arbitrary positions, but those that are precisely determined by the energy levels of each molecule.

Linear molecule and its degrees of freedom. Source: https://arxiv.org/pdf/2202.01079

A short lyrical digression – why are there no cyclic or branching protein structures in nature? As ordinary people, we don’t know (if you know, tell us in the comments). But logically, in nature, protein structures are formed during the process of replication. Such processes of spontaneous replication are impossible for branching structures; moreover, they are possible only for a limited proportion of linear and cyclic structures. But in laboratory conditions, branching structures can be synthesized and studied, which is where computer modeling helps us.

For proteins consisting of a large number of amino acids, it was necessary to expand the abstract description in the form of chemical formulas, which we were accustomed to in chemistry lessons. Biologists and chemists operate with a four-level representation:

Primary Protein Structure is a one-dimensional sequence of amino acids.
Secondary Protein Structure – folding into repeating structures, linear or helical.
Tertiary Protein Structure – three-dimensional folding through the interaction of side atoms and groups of atoms (protruding away from the main chain, which consists mainly of carbon atoms).
Quaternary Protein Structure – the protein structure is formed by several chains of interacting amino acids. To visualize the quaternary structure of a protein, representations in the form of lines, in the form of ribbons, in the form of rod bonds between atoms, and in the form of the surface of an electron cloud surrounding the protein molecule are also used.

The figure below illustrates the structure of the human fetal deoxyhaemoglobin protein (PDB: 1FDH)

Different representations of protein structure. Source: https://arxiv.org/pdf/2409.17726

I think from the above you are convinced that the study of protein folding is an incredibly complex field. Let's take a short look at the history of how protein structure prediction problems were solved until 2024.

1994 – Critical Assessment of protein Structure Prediction

Today we are already accustomed to hackathons. What about the protein folding championship? In 1994, the first CASP championship—Critical Assessment of Protein Structure Prediction—was held, and since then it has been held every two years. Teams of researchers obtain amino acid sequences and compete to predict secondary and tertiary structures for previously unexplored proteins. Neither the organizers, nor the experts, nor the participants know the structure of the proteins being tested until the end of the prediction stage.

2005-2008 – Rosetta@home and Foldit

Back in 2003, the Human Genome Project sequenced 85% of the human genome. Researchers have determined the amino acid sequences of almost all proteins in the human body and decided: since we have studied everything that exists, let's create in a virtual test tube something that does not exist. For example, new, more active proteins or ways to change structures that are responsible for serious diseases.

This is how the volunteer computing project Rosetta@home was born, where collective computing resources were used by researchers to predict the tertiary structure of proteins and predict the interactions of protein structures. But there was one problem: there weren’t that many scientists who could come up with new molecules in their free time. And to speed up the victory over cancer and Alzheimer's, enthusiasts decided to popularize protein folding and turn a tough game for scientists into a Rubik's cube that everyone can practice with. The main enthusiast of this initiative was the future laureate of this year, David Baker.

It was he and his colleagues who developed the online puzzle Foldit, where people, even without specific knowledge in the field of chemistry, can “twist” the amino acid sequence to solve a specific problem. After all, 38,000 goals (the number of Rosetta@home users in 2011) is good, but 240,000 (the number of Foldit players in the year of release) is better. The goal of the puzzle is to find the three-dimensional structure of a particular protein with the lowest free energy level. Each task is published on the site for a certain period of time, during which users compete with each other. With the help of Foldit, several scientific breakthroughs were made: for example, deciphering the structure of the virus that causes AIDS in monkeys and changing the structure of the protein responsible for catalyzing the Diels-Alder reaction.

2017—2024 — AlphaFold

It's time to talk about the most interesting things. Watch your hands:

In 2010, the startup DeepMind Technologies appeared in London, which deals with artificial intelligence.
In 2014, the company was acquired by Google.
In 2016, the AlphaZero model, developed by the DeepMind team, wins a game of Go against world champion Lee Sedol.
In 2017, AlphaZero received the highest chess rating by defeating the then strongest chess program StockFish 8 in a series of 100 games. DeepMind trains a wide family of Alpha models that achieve brilliant success in different areas.
In 2018, AlphaFold was added to the Alpha family of models, designed for protein structure prediction. The development of the model is led by Demis Hassabis and John Jumper. The research team participates in the 13th CASP Championship and takes first place.
In 2020, AlphaFold2 again solved the main problem of CASP, so well that the world's leading scientific journal Nature called it a “breakthrough.”

To its credit, Google has made the AlphaFold and AlphaFold2 models publicly available for use by other researchers, and has also created the AlphaFold Protein Structure Database and populated it with information on the 200 million protein structures calculated by DeepMind. For those interested, here are links to primary sources:

And now about what's inside.

Inside AlphaFold

According to the description, AlphaFold “directly predicts the 3D coordinates of all heavy atoms for a given protein using the primary amino acid sequence and aligned homolog sequences as input.”

I recommend reading the original sources, you won’t regret it

The AlphaFold2 DeemMind architecture was discovered in an article in the journal Nature, published on July 15, 2021 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8371605/). The most interesting thing is in the supplement in this article, in Supplementary information. Experts in the field of machine learning can immediately follow the link to study it https://pmc.ncbi.nlm.nih.gov/articles/instance/8387230/bin/41586_2021_3819_MOESM1_ESM.pdf.

I will share my opinion on what was important, interesting, really very complex and breakthrough in AlphaFold2.

The first thing that attracts attention is the training dataset. The DeepMind team used data on 250+ thousand proteins and their properties (such as sizes, configurations, angles between bonds, distances between atoms and others) from the Protein Data Bank. The next step was to create a separate model to generate synthetic examples based on real data (data augmentation) to create the AlphaFold2 training dataset. Moreover, 25% were original examples of proteins from the Protein Data Bank, and 75% were synthetic. When checking the correctness of synthetic examples, we used distance as a metric Kullback-Leibner for pairs real example-synthetic example.

The AlphaFold2 model works with amino acid sequences that make up proteins called multiple sequence alignments (MSA). The input to the AlphaFold2 model is the embeddings of the MSAs that make up the proteins, and features of known pairs of sequences of known proteins (from the training dataset). The network consists of two main modules: the Evoformer module and the structure module.

High-level diagram of the AlphaFold2 neural network architecture

Evoformer operates with MSA embeddings, as well as data on the geometry of real molecules (angles, distances, configuration). The module includes 48 sequential blocks and uses the attention mechanism, including for calculating angles between atoms. To quote DeepMind: “The key innovations in the Evoformer block are new mechanisms for sharing information within MSA and pairwise representations that allow direct reasoning about spatial and evolutionary relationships.” As a result, Evoformer implements a geometric approach when calculating the shape of protein molecules.

As a result, Evoformer produces:

the Nseq × Nres array, which represents the processed MSA (Nseq is the number of amino acid sequences, Nres is the number of residues – the very groups of atoms that we talked about in the “Geometric Chemistry” section);
an Nres × Nres array that represents pairs of residues.

The output of Evoformer is fed to the input of the structure module for reconstruction.

One of the 48 blocks of the Evoformer module of the AlphaFold2 neural network

The structure module is used to reconstruct the predicted shape of a molecule by representing rotation and translation for each protein residue. The module iteratively models the evolution of the initial state (position of atoms and bonds), repeatedly feeding the result of the module's operation back to the input. This iterative refinement (the authors call it “recirculation”) significantly improves accuracy with a slight increase in training time. The preservation of the sequence of N-Cα-C atoms in a protein molecule while restoring the shape of the molecule is ensured by the Invariant point attention (IPA) module.

During the 3D structure reconstruction process, many constraints must be met, including the triangle inequality for distances. To do this, a combination of sequential triangle updating operations and the operation of the triangle self-attention module is used. This process is more accurate and productive in comparison with the attention mechanism alone or updating triangles separately.

When reconstructing the geometry of a predicted protein, AlphaFold2 uses quaternions. This is an extension of complex numbers used in mechanics to describe the motion of a rigid body. Quaternions represent a number as q=a+bi+cj+dkWhere a, b, c, d are real numbers, and i, j, k – imaginary units with properties i2 = j2 = k2 = ijk = −1.

Quaternions are the pinnacle of higher mathematics. They allow you to conveniently represent the rotation of objects in space, simplify calculations and minimize possible errors in calculations. The use of quaternions is a confirmation of the highest level of elaboration of the problem in terms of the geometric properties of molecules.

Also in the structure module, a mechanism is implemented that takes into account the location of neighboring groups of atoms and the peptide bonds that arise between them.

As a result, at the output of AlphaFold2 we have data on the position of the atoms that make up the protein molecule, the shape of which is modeled by the network.

In doing so, AlphaFold2 achieves very high accuracy and has proven to be much more accurate than competing methods. The median precision of AlphaFold2 is 0.96 Å (angstroms, 10-10 meters), which is comparable to carbon atom sizes of 1.4 Å. At the same time, AlphaFold2 can be used for the analysis of proteins with long chains and domain packaging without a significant loss of accuracy.

And as a cherry on top, the model is publicly available in the repository with a description of how to deploy an image to Google Cloud. AlphaFold2 is very economical in terms of computing resources – it only needs 12 vCPUs, 85 GB RAM and one A100 GPU to operate. According to DeepMind, AlphaFold2 has already been used more than 2 million times.

Instead of a conclusion

We often hear that the Nobel Prize was awarded unfairly: sometimes it goes to biologists for chemistry, sometimes to computer scientists for physics. Indeed, David Baker is a bioinformatician and biochemist who has devoted almost his entire life to the design of proteins and the prediction of their tertiary structure. Demis Hassabis John Jumper's contributions lie more in the fields of data science, computational biology and chemistry, all of whom have won Nobel Prizes for the creation of neural network-based tools and their applications. Should the public be outraged about this? I'd say it's something to be happy about. The huge amount of cross-disciplinary research and the fact that it is becoming difficult for us to draw clear lines between physics, chemistry, biology, medicine and computer science speaks more about how deeply we have penetrated into the essence of the world around us than about the bias of the Nobel committee.

The main conclusion that can be drawn from this year’s awards is that the Nobel Committee and the scientific world as a whole have recorded the emergence of a new reality, in which neural networks are the same tool in the hands of a researcher as a microscope or a particle accelerator. Previously, science had experimental facts and theories that allowed us to interpret reality and had predictive power. Now the artifacts of science have been replenished with machine learning models, which is noteworthy; models are entities that have predictive power, but are not interpretable.

Sources for the curious

Other blog articles: