FREED++. Accelerating the search for new drugs using neural networks

Hello! My name is Alexander Telepov, I am a researcher at AIRI. Our team is focused on applying deep learning to life sciences. Our interests include such tasks as materials design, solubility analysis, or drug discovery. I would like to talk about the latter in more detail.

Many have heard that neural networks are used today to search for new compounds. Take, for example, the much-talked-about AlphaFold 3 from DeepMind, which solves the problem of generating the three-dimensional structure of various molecular complexes. There are other problems in which neural networks have succeeded over classical numerical methods. The most striking example is the generation of drug molecules. One of the most notable approaches to this problem was the FREED framework for generating drug molecules based on reinforcement learning methods. But it also turned out to be far from ideal.

Recently, our research team reproduced, thoroughly studied and significantly improved FREED. We will present our results in the journal TMLRthe article is available at archiveHere I will briefly talk about FREED itself and its problems, as well as the essence of our corrections to this approach.

Drug development

It is difficult to find a person today who has never taken any medicine. It is equally difficult to overestimate the importance of pharmacology and the science of drug development. For the sake of clarity, we will call drugs substances used in the prevention, diagnosis and treatment of diseases.

Typically, such development occurs according to the drug-target scheme in three stages:

  1. First, a target protein is identified—a key molecule involved in a specific metabolic or signaling pathway associated with a particular disease state.

  2. The three-dimensional structure of the protein is determined (usually by methods crystallography).

  3. A molecule with the desired set of properties is found that selectively interacts with the target protein.

The latter task is the subject of study in such a field as medicinal chemistry. In its simplest form, the main task of medicinal chemistry can be formalized as follows:

Suggest a chemical compound (small molecule, ligand) that, when bound to a protein of interest, significantly changes its functional state.

Quantitatively, this effect is expressed by a change Gibbs energies for protein-ligand complex \Delta G_{complex}. Estimating the change in Gibbs energy is a separate complex problem that is solved by molecular modeling methods, usually by the method molecular docking. It should be noted that the ability to significantly change the state of a specific target protein is a key, but not the only, property of a substance that determines the possibility of its use as a drug. When developing drugs, many other properties should be taken into account: selectivity of the interactions formed (the molecule must selectively interact with the protein, since the effect on other proteins in the body can cause undesirable side effects), toxicity, reactivity, ease of synthesis, and so on.

The multifactorial nature of the problem led researchers to the idea of ​​trying to apply deep learning methods here. In this case, the task is naturally formulated in terms of generating a molecular graph, the nodes of which contain atoms, and the edges correspond to chemical bonds. One of the DL approaches to such a construction is to reduce the problem to reinforcement learning, provided that we can somehow evaluate the quality of the generated graph. This approach allows us to directly optimize the desired properties of the molecule, which is well consistent with the problem statement from the previous paragraph. Let's talk about RL.

Reinforcement learning

Reinforcement learning — is a framework for optimizing the behavior of an agent in an environment. The agent interacts with the environment in a closed loop. At each iteration of the interaction, the agent, being in the state s_t \in \mathcal{S}performs the action a_t \in \mathcal{A}. After performing the action, the environment transfers the agent to the state s_{t+1} (s_{t+1} \sim \mathcal{P}(s_t, a_t)) and gives out a reward r_t \in \mathbb{R} (r_{t} \sim \mathcal{R}(s_t, a_t, s_{t+1})). Subsequence (s_0, a_0, r_0, s_1, a_1, r_1, ...) is called a trajectory and is denoted by \mathcal{T}The agent's task is to find a strategy \pi(s_t) (the rule by which sampling is performed a_t \sim \pi(s_t)), maximizing the discounted expected profit along all possible trajectories

\mathbb{E}_{\mathcal{T} \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^{t} r_t \right] \xrightarrow[]{} \min_{\pi}.

Formally, to formulate a learning problem, it is necessary to define a tuple (\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma), Where \mathcal{S} — the space of possible states, \mathcal{A} — space of action, \mathcal{P} — transition dynamics function, \mathcal{R} — reward function, \gamma — discount factor.

This problem has been well studied, and today there are a huge number of methods for solving it. For high-dimensional state spaces and spaces with complex structure, deep reinforcement learning methods have proven to be the most effective, in which the value function of the state (or state-action) and/or the agent's strategy are parameterized by neural networks.

Directed generation of molecular graph

As I noted above, the generation of ligands with desired properties can be formulated within a reinforcement learning framework. For this, the state space is defined as the set of all possible molecular graphs (note that formally the state is also must include notation of the time remaining until the end of the episode), and the action typically consists of the addition or removal of an atom or molecular fragment.

The reward function can be introduced in several ways:

r_{t+1} = J(s_{t+1}) - J(s_t)

or

r_{t+1}= \left\{ {\begin{array}{} 0 & if \ s_{t+1} is a non-terminal \ state; \\ J(s_{t+1}) & if \ s_{t+1} is a terminal \ state. \\ \end{array}} \right.

Where J — the property that needs to be maximized. In the simplest case, as J one can take some estimate of the binding energy to a specific protein (obtained, for example, by molecular docking). Generation can begin with a random, fixed, or empty fragment. For both reward options, solving the reinforcement learning problem will be equivalent to finding a ligand with maximum affinity (with \gamma = 1).

A schematic representation of this approach is shown in Fig. 1.

Fig. 1. Schematic of generative methods based on reinforcement learning. The agent takes the current state as input and selects an action. The action typically consists of attaching a molecular fragment to the currently assembled state.

Fig. 1. Schematic of generative methods based on reinforcement learning. The agent takes the current state as input and selects an action. The action typically consists of attaching a molecular fragment to the currently assembled state.

It should be noted that until the advent of the FREED model (i.e., the end of 2021), the vast majority of works on molecular generation (CVAE, ORGAN, REINVENT, GraphVAE, MolGAN, JT-VAE, GCPN) the problem of unconditional generation was considered, one in which the target protein is not taken into account (although there were exceptions – LiGAN). Such models are not optimal, since the therapeutic effect of a molecule is largely determined by its binding energy to the target protein.

FREED Method

At the end of 2021, at the NeurIPS conference was published a promising approach to generating drug molecules based on reinforcement learning methods FREED. Unlike most previous works, FREED takes into account the biological target protein when generating ligands. In addition, the method directly optimizes the binding energy of the protein to the ligand. Let us note several more useful properties of FREED:

  1. Fragment generation significantly reduces the space of possible actions and states, allowing for efficient training of the RL agent;

  2. FREED is working on an extended molecular graph – a graph that has special nodes, attachment points (these nodes identify the places where molecular bonds were broken during fragmentation). If fragments are attached only to these points, the assembled molecule will automatically satisfy the valence rules, provided that the missing hydrogens are added to the molecule after the terminal step.

The results described in the FREED paper are truly impressive. However, after conducting a large number of experiments and carefully analyzing the source code of the publication, we found out that the proposed method has significant limitations and problems. Namely:

  • the current implementation contains multiple bugs;

  • the comparison protocol is inconsistent between the proposed model and the baselines;

  • the number of selected target proteins is insufficient for quality assessment;

  • the model is overcomplicated and the effects of its individual parts are under-researched.

From FREED to FFREED and FREED++

To overcome the limitations of FREED, we carefully examined the source code of the publication, corrected errors, performed additional ablations, and simplified the proposed model.

Before describing the changes made to the original model, it is necessary to describe it in more detail. However, since the original model contains bugs and inconsistent blocks, this may mislead the reader. Therefore, I will first describe how the corrected FFREED (Fixed FREED) model is structured, and then give an example of one of the bugs found in the model.

F.F.REED

Let us consider how the action selection (policy architecture, actor) and the state-action value function (critic) are structured in FFREED.

An action is a composition of three elements. The first component is the attachment point on the molecule assembled to this step. The new fragment will be attached to this point. The second component is the fragment to be attached. It is chosen from a fixed dictionary of fragments obtained during the fragmentation procedure. And finally, the third component is the attachment point on the fragment.

The action selection process is shown in Fig. 2a.

Step 1: current state s is processed by the graph network, and its embedding is combined with the embeddings by the attachment points. Then one of the attachment points is selected a_1and its embedding \tilde{a}_1 (more accurately, embedding conditioned on state embedding) is used in the next step.

Step 2: \tilde{a}_1 is combined with the embeddings of the available fragments. One of the fragments is selected as a_2and its embedding  \tilde{a}_2 is used in the next step (in fact, the selection of a fragment is arranged a little more complicated, this is described in more detail in our article).

Step 3: the attachment point embeddings of the selected fragment are merged with \tilde{a}_2. Finally, one of the fragment attachment points is chosen as a_3.

Critic: The embeddings of all actions are concatenated with the state embedding and processed by the critic (Figure 2b).

Fig. 2. Schematic architectures of actor and critic in the FFREED model.

Fig. 2. Schematic architectures of actor and critic in the FFREED model.

Below we will describe one of the most significant bugs related to the parameterization of the critic (state-action value function).

In the original implementation, the critic is parameterized by a multilayer perceptron that takes as input a concatenation of 4 vectors:

1) molecule embedding obtained by a graph neural network;

2) vectors of probabilities associated with attachment points on the molecule;

3) one-hot presentation of the selected fragment;

4) vectors of probabilities associated with the attachment points on the selected fragment.

There are two significant problems with this implementation. First, the critic does not receive any information about the chosen attachment points, since it operates only on a probability distribution over these points. Second, the order in which the probabilities themselves are fed to the critic is determined by the internal representation of the current state. This order may change as new fragments are attached along the trajectory. Thus, the critic cannot associate high rewards for choosing certain attachment points. Accordingly, the task of learning any reasonable policy for choosing attachment points becomes intractable. In our implementation, instead of probability vectors, we feed the critic embeddings of the chosen attachment points.

After fixing all the bugs, the model successfully copes with reward optimization, which is clearly visible in the training graph, see Fig. 3.

Fig. 3. Average reward (docking score) per episode during training for the protein USP7. The shaded region denotes the 95% confidence interval.

Fig. 3. Average reward (docking score) per episode during training for the protein USP7. The shaded region denotes the 95% confidence interval.

In addition, in ablations we investigated how the elimination of each bug to the original FREED model affects the quality of generation. It turned out that fixing any individual bug individually does not lead to a significant improvement in quality – all errors must be fixed together.

FFREED++

In the next step, we significantly accelerated the FFREED model and reduced the number of trainable parameters in it. We called the resulting model FREED++. I will tell you in order what exactly we did.

Architecture of politics. Let us consider in more detail how the policy architecture is structured in the FREED and FFREED models. The action selection occurs autoregressively, and each component of the action a_i is selected by a separate neural network. They are arranged in a similar way, schematically shown in Fig. 4.

Fig. 4. Scheme for calculating the conditional probability of actions for FFREED and FREED++.

Fig. 4. Scheme for calculating the conditional probability of actions for FFREED and FREED++.

Each policy component computes a probability distribution over the set of available options, subject to some condition. For example, for a_1 the probability of choosing an attachment point is calculated, given that the point is chosen on a molecule assembled to the current step. First, the embedding associated with each option is merged with the embedding of the condition, e.g., the embeddings are concatenated (in the diagram, the merge function is denoted by f_{\phi}). Then the embedding of the conditional option is processed by a multilayer perceptron f_{\psi}: \mathbb{R}^{d} \xrightarrow{} \mathbb{R}to obtain the corresponding logit. Finally, the conditional distribution over the options is calculated using the SoftMax operator (actually, a trick is used Gumbel-Softmaxsince the agent is trained by the method SAC).

Merge functions. The fusing function is needed to combine several embeddings into one. In the FREED and FFREED models, fusing functions are used. Multiplicative Interactions (MI). Let x \in \mathbb{R}^{d_1}, z \in \mathbb{R}^{d_2} represent the embeddings to be combined. Then, the MI layer applied to x And z will look like this:

\mathbf{f}^{MI}_{\phi_i}(x, z) = z^T \mathbf{W}_i x + \mathbf{U}_i z + \mathbf{V}_i x + \mathbf{ b}_i,

Where \mathbf{W} \in \mathbb{R}^{d_2 \times d_3 \times d_1} — is a trainable three-dimensional tensor, \mathbf{U} \in \mathbb{R}^{d_3 \times d_2}, \mathbf{V} \in \mathbb{R}^{d_3 \times d_1} — trainable matrices, and \mathbf{b} \in \mathbb{R}^{d_3} — displacement.

Note that the MI layer contains the calculation of the bilinear form, which is expensive in terms of time and memory.

FFREED Speedup. Analyzing the calculation time of different parts of the policy, we found that the main time is spent on the step of calculating the probability distribution on a set of fragments. This is explained by 3 factors:

1) MI interactions are used as fusion functions;

2) input embeddings have high dimensionality (1024);

3) the merge function needs to be calculated N time, where N — the dimension of the dictionary of fragments (in the original article N=91).

If the set of options is fixed, then we can obtain the distribution over the options in an alternative way. If N — the number of options, then logits can be obtained directly by applying a multilayer perceptron f_{\psi}: \mathbb{R}^{d} \xrightarrow{} \mathbb{R}^{N} to embedding conditions. In this case, it is not necessary to calculate the value of the merge function even once, and the calculation f_{\psi} only needs to be done once, instead of the original ones N once.

In the FREED++ model, we replaced the MI layer with a computationally cheaper and commonly used linear layer concatenation combination:

\mathbf{f}^{CAT}_{\phi_i}(x, z) = \mathbf{Q}_i[x; z]  + \mathbf{b}_i,

The final FREED++ model obtained after all simplifications turned out to be 8.5 times faster and 22 times less memory-intensive than the FFREED model. It is important that the model was accelerated without any loss in generation quality.

Table 1. Comparison of FFREED and FREED++ speed. Time denotes the number of seconds for 1 step of gradient descent on a batch of size 100.

Table 1. Comparison of FFREED and FREED++ speed. Time denotes the number of seconds for 1 step of gradient descent on a batch of size 100.

Comparison with baselines

After fixing the errors in the source code and optimizing the model, we compared our result with other existing models on the problem of generating high-affinity molecules. In the original work, the comparison protocol between baselines and the FREED model is inconsistent, and the set of test proteins is insufficient for an accurate comparison. In our work, we expanded the set of test proteins from 3 to 6, performed a hyperparameter search for baselines, added 2 methods for comparison (combinatorial generator and Pocket2Molwhich is still considered a strong baseline today), and also increased the computational budget for baseline models (in the original paper, the REINVENT and MolDQN models were significantly undertrained).

Table 2. Comparison of FREED, FFREED and FREED++ with baselines on the affinity optimization problem for the USP7 protein.

Table 2. Comparison of FREED, FFREED and FREED++ with baselines on the affinity optimization problem for the USP7 protein.

Our results show that the FFREED and FREED++ models are capable of generating molecules with higher affinity values ​​than existing approaches.

Conclusion

So, we seem to have managed to fix, simplify, and improve the FREED model, and quite well. However, it would be disingenuous to claim that we have reached the limit of perfection in the task of generating drug molecules using deep learning methods.

As I mentioned, RL is just one method. New work in molecular generation is published quarterly, and today the arsenal of tools for drug discovery is quite large.

In my opinion, the most interesting at the moment are diffusion modelswhich reconstruct the joint distribution of small molecule atoms and protein. However, they are limited by existing datasets and are also unable (without additional add-ons) to optimize the binding energy, which is a key property of a drug. In addition, most of them work only with a small piece (pocket) of the protein. A more realistic approach should consider the entire protein, for example, as is done in AlphaFold 3.

The second promising class of methods is generative flow neural networks. They can be considered as a more efficient version of reinforcement learning, allowing to generate more diverse molecules with high values ​​of the properties of interest. However, these methods are not without their drawbacks: developing a reliable reward function that reflects all the desired properties of a molecule is a difficult task. The most promising direction, in my opinion, will be the development of methods that combine these two approaches.

The implementation of FFREED and FREED++ is available on our githubwe will be glad to receive your stars and forks!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *