Chemical IT Centaur – Chemoinformatics

In the 21st century, we are faced with the rapid development of multidisciplinary sciences, where information technology plays a key role. One of the areas of interest where these technologies are being actively applied is chemoinformatics. chemoinformatics). The definition given by I. Gasteiger sounds like this, Chemoinformatics is the application of computer science methods to solve chemical problems [1].

Why is chemoinformatics needed?

Currently, according to one of the largest chemical databases, PubChem, more than 100 million organic molecules are known to mankind. [2]. It would seem to be a very large number, but the size of the entire chemical space is estimated to be staggering. 10^180 substances that can potentially be synthesized [3]. This is a hundred orders of magnitude greater than the number of atoms in the universe. Having methods for navigating in known and unknown areas of chemical space is critically important, because among the unknown 10^180 substances there are probably many useful compounds – these are new drugs, dyes, agrochemicals, fragrances for perfumes, materials for electronics and other materials important to people.

The scale of even the part of chemical space we know requires the use of big data analysis methods. This was one of the reasons for the emergence of a discipline at the intersection of chemistry and IT – chemoinformatics. But what exactly does chemoinformatics do in an applied sense? The main directions are shown in the diagram below.

1. Prediction of properties

There are many properties that are successfully predicted using machine learning and/or computer modeling methods. These include various physicochemical parameters (solubility, lipophilicity), biological activity, reactivity, and many others. [4,5,6,7]. In this article, we will consider toxicity prediction due to the importance of this parameter. Every year, about several million new substances are synthesized, but the safety profile is established for only about 10 thousand. This means that within 1 year we establish the toxicity of compounds synthesized Total in 1 day.

There is a serious lag in the speed of experimental testing of the toxicity of new molecules from the speed of their synthesis. But even if all the scientific and technical potential of humanity were aimed at toxicological profiling for all new substances, could we do it? Let's look at an example. To register a pesticide in the United States, about 80 toxicity tests are required, their total cost is estimated at more than $20 million [8]. Toxicity is a very multifaceted concept that requires dozens of experiments. The cost of some of them according to the US Environmental Protection Agency (EPA):

  • Acute toxicity (fish) $17,000

  • Chronic toxicity (daphnia) $180,000

  • Carcinogenicity (rats and mice) $2,100,000.

In accordance with the above data, a total experimental determination of toxicity for all new compounds seems, indeed, unrealistic from the point of view of the colossal expenditure of resources on this task.

In addition, toxicity is in most cases assessed on laboratory animals, the scale of which is currently used worldwide at almost 200 million per year. [9]. There is even a series of cartoons dedicated to their protection. Received particular fame Ralph the rabbit. It is ethically preferable to avoid testing on animals, which is enshrined, for example, in the EU REACH regulation [10].

Modern approaches allow us to create alternative solutions that reduce the cost of profiling by orders of magnitude, and also do not raise ethical issues associated with animal experiments – this is a prediction of toxicity in silico, that is, by using computational technologies. Of course, there is a certain error in predictive modeling, but at least it helps to prioritize research and minimize animal waste. Moreover, the use of these methods reduces the risks for volunteers participating in clinical trials.

There are already a number of digital tools that can predict the toxicity of substances, including a number of methods developed by Russian scientists [11, 12]. Using the Sintelli digital platform based on these methods, more than 40 toxicity parameters can be predicted [13].

2. Design of substances and materials

There are tools that allow you to generate a molecule with specified properties. There are many uses for this, but in this article we will focus on medicines. Drug development is a complex process that takes an average of 10–15 years and can cost up to $3 billion. [14], however, the use of chemoinformatics methods can significantly reduce both time and material costs (Fig. 1). R&D departments of large pharmaceutical companies can no longer exist without specialists such as cheminformatics, computational and digital chemists. The fact is that chemoinformatics significantly optimizes the solution of basic pharmaceutical problems: search for active molecules, determination of toxicity, solubility in water, permeability through the cell membrane, metabolism, interaction with blood plasma proteins, etc. Drug design using computer technologies (Computer-aided drug design) is developing at an impressive pace and the following 2 examples are especially impressive:

§ Company Insilico Medicine in 2023 announced the successful completion of preclinical studies of the INS018-055 molecule for the treatment of idiopathic pulmonary fibrosis. Interestingly, AI not only generated an active molecule, but also identified a target for treatment. To date, the molecule has successfully completed phase I clinical trials, thereby confirming the safety of use [15]. Phase II clinical trials are ongoing.

§ In 2020, the British pharmaceutical company Excienta announced the successful completion of preclinical studies of the DSP-1181 molecule for the treatment of obsessive-compulsive disorder (OCD) [16].

“This year was the first year that a drug was developed using AI, but by the end of the decade, all new drugs could be created using algorithms,” says Andrew Hopkins, CEO of Exscienta.

Figure 1. Drug development strategies.

Picture 1. Drug development strategies.

Another program that attracts attention is PASS, developed by Russian scientists and allowing to predict a whole range of parameters characterizing the biological activity of compounds. [17, 18]. Also worth noting is the USPEX method, which allows predicting the crystalline structure of substances at different temperatures and pressures based on knowledge of only the chemical composition of the material. [19].

3. Identification of substances

This area is associated with processing the results of studying substances and their mixtures using physicochemical methods of analysis. Strictly speaking, it is called chemometrics and is considered a derivative of chemoinformatics. For clarity, let's look at an example (Fig. 2). You want to determine whether there are pesticides in the soil at your dacha. You collect the soil and send it to the laboratory, where it undergoes special sample preparation and is analyzed on a device (HPLC-MS). First, using a chromatograph (HPLC), the mixture of substances is separated so that the substances enter sequentially into the next unit of the device – a mass spectrometric detector. The latter provides some information about each substance from the sample (mass spectra). At this stage, it is not yet possible to draw a conclusion about what is contained in the soil. The mass spectra produced by the device need to be analyzed. In some cases, this can only be done by a qualified specialist, but a more efficient approach is to have a computer autonomously compare the mass spectra for each substance in your sample with annotated mass spectra from a database (which, by the way, requires data curation by cheminformaticians). If they match, then the substance has been identified.

Figure 2. Small molecule identification algorithm

Figure 2. Small Molecule Identification Algorithm

But there is 1 problem. Databases (DBs) are not comprehensive. Thus, one of the largest NIST databases contains mass spectra for 350,000 substances, which is 0.2% from the number famous us molecules (~10^8) [20]As a result, we observe a striking lack of awareness of the world around us:

v When analyzing household dust, 33 substances out of 5,000 were identified [21]

v When analyzing wastewater – 1.2% of substances [22]

In fact, we still don’t really know what surrounds us. However, humanity has hope to get closer to solving this problem. A number of chemoinformatic tools have been developed that help identify the structure of a substance based on its mass spectrum. [23, 24, 25]. Of course, it is worth mentioning that this is only a forecast and in some cases it is impossible a priori to determine the structure from the mass spectrum alone, but nevertheless it is a useful tool in the absence of alternatives. There is also a reverse approach, which involves predicting mass spectra and, as a result, expanding databases for comparison with analyzed samples. This problem can be solved using methods developed by Russian scientists, which make it possible to predict other spectral data that are no less important for the identification of substances (NMR, IR) [13].

There is another important example for us – medicines. How to understand whether the actual composition of the drug corresponds to the declared one? In practice, the IR spectroscopy method is widely used to solve this problem. For express analysis, special portable versions of the IR spectrometer weighing up to 2 kg have been developed [26]. The drug can be analyzed directly in the blister (non-destructively), the analysis time is 5-10 seconds. But, as in the previously described mass spectrometry, the obtained IR spectra need to be analyzed and here chemometrics comes to the rescue. As a result, such a hybrid technology allows for quick and effective quality control of drugs in production, as well as monitoring of counterfeit and poor-quality drugs in circulation.

In conclusion, chemoinformatics emerged as a response to the need for efficient analysis of huge volumes of chemical data, which had become impossible using exclusively experimental methods due to their time-consuming, labor-intensive, and expensive nature. Despite controversy over the accuracy of the methods in silico, the achievements of chemoinformatics, especially in the field of drug development, are undeniable. We are expected to see many breakthroughs in molecular sciences in the near future through the use of predictive modeling techniques.

Literature

1. Gasteiger, J. (2016). Chemoinformatics: Achievements and challenges, a personal view. Molecules, 21(2), 151.

2. https://pubchem.ncbi.nlm.nih.gov/ (accessed 14.06.2024)

3. Restrepo, G. (2022). Chemical space: limits, evolution and modeling of an object bigger than our universal library. Digital Discovery1(5), 568-585.

4. Osipenko, S., Bashkirova, I., Sosnin, S., Kovaleva, O., Fedorov, M., Nikolaev, E., Kostyukevich, Y. (2020). Machine learning to predict retention time of small molecules in nano-HPLC. Analytical and Bioanalytical Chemistry412, 7767-7776.

5. Karlov, D. S., Sosnin, S., Fedorov, M. V., Popov, P. (2020). graphDelta: MPNN scoring function for the affinity prediction of protein–ligand complexes. ACS omega5(10), 5150-5159.

6. Dmitriev, AV, Rudik, AV, Karasev, DA, Pogodin, PV, Lagunin, AA, Filimonov, DA, Poroikov, VV (2021). In silico prediction of drug–drug interactions mediated by cytochrome P450 isoforms. Pharmaceutics13(4), 538.

7. Sosnina, E. A., Sosnin, S., Fedorov, M. V. (2023). Improving multi-task learning through data enrichment: application for drug discovery. Journal of Computer-Aided Molecular Design37(4), 183-200.

8. https://www.epa.gov/pesticide-registration/cost-estimates-studies-required-pesticide-registration (accessed 14.06.2024)

9. Taylor, K., Alvarez, L. R. (2019). An estimate of the number of animals used for scientific purposes worldwide in 2015. Alternatives to Laboratory Animals, 47(5-6), 196-213.

10. Lilienblum, W., Dekant, W., Foth, H., Gebel, T., Hengstler, J. G., Kahl, R., et al. (2008). Alternative methods for safety studies in experimental animals: role in the risk assessment of chemicals under the new European Chemicals Legislation (REACH). Archives of toxicology, 82211-236.

11. Sosnin, S., Karlov, D., Tetko, IV, Fedorov, M. (2018). Comparative study of multitask toxicity modeling in a broad chemical space. Journal of chemical information and modeling59(3), 1062-1072.

12. Sosnin, S., Misin, M., Fedorov, M. (2017). Predicting bioaccumulation using molecular theory: A machine learning approach. arXiv preprint arXiv:1710.08174.

13. https://syntelly.ru/ (accessed 06/14/2024)

14. Wouters, O. J., McKee, M., Luyten, J. (2020). Estimated research and development investment needed to bring a new medicine to market. 2009-2018. Jama323(9), 844-853.

15. https://insilico.com/blog/first_phase2 (accessed 06/14/2024)

16. https://www.frontierip.co.uk/portfolio-companies/exscientia-worlds-first-trials-of-new-drug-candidate-created-by-artificial-intelligence (accessed 06/14/2024)

17. Filimonov, DA, Lagunin, AA, Gloriozova, TA, Rudik, AV, Druzhilovskii, DS, Pogodin, PV, Poroikov, VV (2014). Prediction of the biological activity spectrum of organic compounds using the PASS online web resource. Chemistry of Heterocyclic Compounds50, 444-457.

18. Rudik, A. V., Dmitriev, A. V., Lagunin, A. A., Filimonov, D. A., Poroikov, V. V. (2019). PASS-based prediction of metabolites detection in biological systems. SAR and QSAR in Environmental Research30(10), 751-758.

19. https://uspex-team.org/ru (accessed 14.06.2024)

20. https://chemdata.nist.gov/dokuwiki/doku.php?id=chemdata:start (accessed 14.06.2024)

21. Rager, JE, Strynar, MJ, Liang, S., McMahen, RL, Richard, AM, Grulke, CM, Sobus, JR (2016). Linking high resolution mass spectrometry data with exposure and toxicity forecasts to advance high-throughput environmental monitoring. Environment international, 88269-280.

22. Schymanski, E. L., Singer, H. P., Longrée, P., Loos, M., Ruff, M., Stravs, M. A., et al. (2014). Strategies to characterize polar organic contamination in wastewater: exploring the capability of high resolution mass spectrometry. Environmental science & technology, 48(3), 1811-1818.

23. Kangas, L. J., Metz, T. O., Isaac, G., Schrom, B. T., Ginovska-Pangovska, B., Wang, L., Miller, J. H. (2012). In silico identification software (ISIS): a machine learning approach to tandem mass spectral identification of lipids. Bioinformatics28(13), 1705-1713.

24. Krettler C. A., Thallinger, G. G. (2021). A map of mass spectrometry-based in silico fragmentation prediction and compound identification in metabolomics. Briefings in Bioinformatics22(6), bbab073.

25. Kostyukevich, Y., Sosnin, S., Osipenko, S., Kovaleva, O., Rumiantseva, L., Kireev, A., Zherebker, A., Fedorov, M., Nikolaev, E. N. (2022). PyFragMS─ A Web Tool for the Investigation of Collision-Induced Fragmentation Pathways. ACS omega7(11), 9710-9719.

26. Balyklova, K. S., Rodionova, O. E., Titova, A. V., Sadchikova, N. P. (2015). Examination of tablets using a portable and laboratory NIR spectrometer. Bulletin of Roszdravnadzor(4), 65-71.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *