yes, again for the AI model
The Nobel Prize in Chemistry was awarded to John Jumper and Demis Hassabis from Google DeepMind, as well as bioinformatician David Baker:
John Jumper and Demis Hassabis learned to predict protein structure using the AlphaFold 2 AI model.
David Baker succeeded in another area – the creation of new types of proteins (computer design of proteins). You could say that Baker comes up with the sequence of amino acids to create artificial proteins.
Scientists were able to solve a 50-year-old problem – predicting the complex structure of a protein from its amino acid sequence – for many years science believed that this was impossible to do. It also seemed to scientists that it was impossible to create new artificial proteins that did not exist in nature before.
The role of proteins and why they are difficult to study
Before moving on to the work of scientists, they need to decide on the object of their research – proteins. What is it? What is their role? Why are they difficult to learn?
What are proteins and amino acids
Proteins control all the chemical reactions at the core of life. Therefore, they are sometimes called “the chemical instruments of life.” These are large biological molecules that consist of a chain amino acids residues (in certain compounds and forms) that perform thousands of functions inside each cell of a living organism.
Essentially, amino acids are the building blocks of life. Depending on the form that the combination of amino acids takes, the biological function of the protein will change – whether it will be an enzyme, a substance carrier or, for example, a regulator. The shape of the protein also shows how it interacts with other proteins.
Understanding and Control folding proteins is perhaps the most important task for fundamental and applied sciences. Many diseases begin with a malfunction of proteins. For example, the SARS-CoV-2 virus targets several targets in the body at once. If you study them and block them in time, you can prevent the virus from multiplying. The problem is that finding out the shape of a protein is quite difficult—long and expensive research is needed. We'll talk more about this in the next section.
What forms does a protein form: hierarchical structure and problems of its determination
Proteins have four levels of organization. The connection between sequence and three-dimensional structure has been proven and postulated in one of the key works in this field – Anfinsen's dogma.
Primary structure. The simplest type of structure is a chain of 20 amino acid residues connected in a certain sequence – it is also called polypeptide. Such a sequence can be written in alphabetical order – three-letter or one-letter. It is clear that there are more than twenty amino acids, but most proteins can get by with this number.
The next levels already determine the shape of the squirrel, or rather its spatial structure.
Secondary structure. In it, a successive chain of amino acids forms stable blocks and folds into a spiral due to hydrogen bonds. Common building blocks of this structure are α-helices and β-sheets.
Tertiary structure. The final form of a protein that it takes after folding (folding/stacking) due to covalent, hydrogen, ionic bonds and other interactions. The shape of a squirrel can resemble many different shapes. Typically, proteins take on a globular or fibrillar form. The former are like spheres and dissolve well in water (example: egg white), while the latter are threads and fibers, they are insoluble in water (example: hair and muscles).
Quaternary structure. Some proteins form a fourth form. It is formed from a complex of several molecules with a tertiary structure.
Gold standard for protein structure determination
The structure of a protein can be determined experimentally using the method X-ray crystallography. Method was invented in the 1950s, and is recognized as the conventional “gold standard” in this field.
Although the method is considered accurate, it is quite expensive – determining the structure of one protein will take months and require expensive equipment.
Plus, the crystallography process can encounter problems already at the first stage of obtaining crystals, because certain conditions are required for protein crystallization. For example, can Even astronauts can be involved to protect the growing crystals in zero gravity.
But solved protein structures are gradually accumulating. In 1971 Nature publishes a note that the collection of a special PDB database in which such protein structures will be stored is beginning. Before the development of the Internet, the PDB existed as recordings on tapes, and by the 2000s it became available to a wider audience. Since that moment, the volume of the database began to grow exponentially: today the number of solved structures is approaching 200 thousand.
AlphaFold 2, like many other neural networks that predict protein structure, was created based on data from the PDB.
How CASPs stimulated the creation of new solutions
The problem of predicting the structure of proteins is such an important and complex task that a separate competition was created for it – in 1994, Critical Assessment of protein Structure Prediction (CASP). This is a protein structure prediction competition where scientists use different algorithms to better predict the structure of proteins of different levels of complexity and outperform their competitors in accuracy.
The results of the models are compared with the results of crystallographers in the laboratory, which were obtained experimentally. The GDT (global distance test) score from 0 to 100 shows how closely the modeled structure agrees with the experimental data.
Until 2018, CASP winners' accuracy rates could not exceed 40%. A breakthrough in this area was made by the DeepMind team with the AlphaFold 1 AI model, showing a result of 60%. In the next CASP 14 competition, the DeepMind team took first place, increasing its prediction accuracy to 92% using AlphaFold 2. According to one of the creators of the CASP competition, this result can be called a success and close to the data of molecular biology, because at this level it is already difficult to say who is right – the model or the biologist-scientist – it is only a matter of error.
David Baker: Computer-aided protein design thanks to Rosetta algorithms
Baker worked to create new proteins that do not exist in nature. In 1998, David Baker and his team accepted participation in CASP 3 with the Rosetta algorithm and continued to refine it until 2003.
The Rosetta algorithm helps researchers design proteins with specific shapes and functions, starting with the desired three-dimensional structure and working backwards to calculate the corresponding amino acid sequence. To evaluate the structure in the Rosetta database, an optimized Monte Carlo method.
A real breakthrough awaited Baker and his team when they managed to create a new artificial protein, Top7 – it could independently fit into a three-dimensional structure, and was not similar to any natural protein. But it didn’t have any useful functions either.
Echoes of Rosetta in the future
In 2005, the Rosetta@home project was created, which helped to circumvent the problem of a lack of computing power to create the three-dimensional structure of proteins.
In 2008, the Foldit project grew out of Rosetta@home. This is a puzzle game in which players compete to fold proteins. For the most successful projects, scientists write academic papers.
Demis Hassabis in an MIT note toldwho played Foldit. We can say that indirectly through this puzzle DeepMind came to the problems of protein folding and the development of AlphaFold. In general, if you also want to try yourself as a scientist, you need only time and desire.
Impact of computer-aided protein design on the global community
This breakthrough has allowed scientists to create proteins and drugs with new properties, including therapeutic (proteins that can inhibit the COVID-19 spike protein) and sustainable targets (environmental proteins that detect opioids).
John Jumper and Demis Hassabis: Protein structure prediction using AI
AlphaFold 1: DeepMind has made a breakthrough in protein structure prediction
AlphaFold 1 was trained on several publicly available datasets:
Protein Data Bank (PDB) is a database containing three-dimensional structures and amino acid sequences of almost all proteins whose structure has been determined by mankind.
Another database, UniProt, contains the amino acid sequences (without structures) of another 200 million proteins.
AlphaFold 1 is based on a convolutional neural network (Convnets, CNN). Such neural networks are used for image recognition using computer vision. AlphaFold 1 applies the same strategies as CNN for image identification. As a result of AlphaFold 1, a multiple sequence alignment is created. M.S.A.) is a two-dimensional matrix (row – type of organism, column – amino acid code), from which hierarchical patterns can be extracted.
It was these patterns that the Google DeepMind team managed to understand and win at CASP 13, achieving an accuracy of almost 60%, but this was not enough – scientists could use the neural network in their work only if the research accuracy was more than 90%.
AlphaFold 2: the role of Transformer architecture in protein structure prediction
From CNN to Transformer
In 2020, the AI model was rebuilt and improved. Instead of CNN, AlphaFold 2 uses advanced Transformers architecture (Transformer) – for example, GPT models and BERT are based on it. One of the key innovations of AlphaFold 2 was the attention mechanism, which allows the AI model to focus on the most significant parts of the protein sequence and structure when making predictions. The attention mechanism allows the system to better sense interactions between different parts of a protein that are critical to its folding and function.
How AlphaFold 2 works
AlphaFold 2 learns from known protein structures and sequences and uses this data to make its predictions:
Accepts a protein sequence;
Extracts characteristics from a sequence, including information about the distances between pairs of amino acids and the angles between bonds;
Models the folding process, predicting the most likely 3D protein structure, as well as the distances and angles between all pairs of amino acids in the protein;
Clarifies the initial structure prediction by adjusting angles and distances between amino acids, compares predictions with real data for other proteins;
Constructs the 3D structure of a protein as a set of coordinates for each amino acid in the chain.
The main work of AlphaFold 2 is located in two modules – Evoformer and Structure Model
Evoformer works in parallel with two sequences: it receives multiple alignment (MSA) and pair representation as input, and returns an improved version of them as output.
The AlphaFold 2 structure module receives an updated pair representation and MSA from Evoformer. First, he turns them into the basis of a 3D structure. And then it completes the modeling by placing the amino acid side chains and refining their positions. After AlphaFold 2 performs an iterative process called “recycling” – the resulting structure is returned to Evoformer – the cycle is repeated until the simulated structure acquires the desired characteristics.
AlphaFold 3: a new architecture for generating 3D structures of proteins, DNA and RNA with atomic precision
In May 2024, a few months before the triumph of AlphaFold 2, Google DeepMind announced the third version of the AI model – AlphaFold 3. The new model gives the most complex predictions – how composite biological structures from a complex of proteins, nucleic acids, ions and other elements will look and interact. The third version of AlphaFold moves away from the transformer model and is based on diffusion models.
Impact of AlphaFold 2 on the global community
More than 2 million scientists from 190 countries use AlphaFold 2 for accessible and fast experiments on protein structure. This helps develop new drugs, advance science and medicine.
Sources
https://www.nobelprize.org/prizes/chemistry/2024/popular-information/
https://www.nobelprize.org/prizes/chemistry/2024/press-release/
https://www.nobelprize.org/uploads/2024/10/popular-chemistryprize2024-3.pdf
https://www.nobelprize.org/uploads/2024/10/advanced-chemistryprize2024.pdf
https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology