Artificial intelligence will first create a corpus of ancient Slavic manuscripts
“In the days of doubt, in the days of painful thoughts about the fate of the motherland”, which are especially difficult in recent weeks, what is support and support for us? 🙂 That’s right, great and mighty. And while exchange rates and a pandemic inexorably hold mass consciousness, scientists do not stop working. About who and why will create the corpus – a unique “DBMS” of ancient Slavic manuscripts – in our news.
Collaboration of scientists of NUST “MISiS, Russian Language Institute named after V.V. Vinogradova RAN, HSE, with the support of the Commission for Work with Universities and the Scientific Community under the Diocesan Council of Moscow, has launched a large-scale project to create, using artificial intelligence and machine learning technologies, a unique base of ancient Slavic manuscripts – the corpus. The creation of the Old Slavic language corpus will give linguistic researchers and historians a powerful tool for studying all modern national Slavic languages and cultures and will become a unique key to understanding their heritage.
Housing – This is a structured database of the language, an information and reference system based on a collection of texts in a particular language in electronic form. It is a selected and specially processed (marked out) a set of texts that are used as the basis for the study of vocabulary and grammar of the language.
Ancient Slavic texts are a variety of manuscript monuments of the XI – XVII centuries, the foundation of all modern national Slavic languages and cultures. The creation of the system corpus of the language is associated with laborious, delicate and painstaking work requiring the combined efforts of professionals from various fields and, according to scientists, is a task of a national nature.
Hieromonk Rodion (Larionov), Deputy Chairman of the Commission for Work with Universities and the Scientific Community at the Diocesan Council of Moscow:
“At present, there is no corpus of handwritten Slavic texts, and its creation is considered by scholars of various disciplines as an important task. The main volume of the Old Slavic – Old Russian, Bulgarian, Serbian texts of the XI – XVII centuries that have come down to us are several thousand liturgical manuscripts. Language changes from century to century. It is important for scientists to understand, firstly, why these changes occur, what they dictated, what affects their occurrence, and secondly, what these changes entailed. If we analyze and systematize with human resources the amount of data that ancient Slavic manuscripts represent, it is an astronomical work that would stretch for centuries, especially considering that there are very few professionals who are capable of doing this work. Recognition and digitization technologies for texts, machine translation and AI will allow this important work to be carried out in the foreseeable future. ”
Artificial intelligence will cover this entire gigantic array of data, systematize and create algorithms for arranging linguistic markup – the main characteristic of the corpus. It is she who distinguishes the case from a simple library.
Projects on the use of digital approaches to the analysis of cultural heritage are actively developing in European countries and are an excellent example of interdisciplinary interaction.
With regard to language monuments, two principal areas of work can be noted – the conversion of scanned images into a “machine-readable” form and the construction of language models that simplify the analysis and understanding of texts. With Slavic texts, the spelling of letters (graphemes) which is characterized by floridness and widespread use of diacritics, such systemic developments have not yet been undertaken.
Andrei Ustyuzhanin, leading expert at the Center for Infrastructure Interaction and Partnership MegaScience NUST “MISiS”, head of the Research and Training Laboratory for Big Data Analysis Methods at the Higher School of Economics:
“Natural language is a key training ground for the development of AI technology. It is thanks to these technologies that machine translation problems, the construction of dialogue systems, and the tasks of interpreting natural language texts have received a powerful impetus recently. In a sense, such a project is a bridge from the culture of the past to the technologies of the future. In our experience of interdisciplinary projects, it’s not so important to get the most advanced technology, how to lay the foundations for people to communicate with each other – language specialists with artificial intelligence specialists. ”
The first stage of the project will be the digitization and marking of the complex of Old Slavic Mena of the XI-XVII centuries in Old Russian, Bulgarian and Serbian – official church books containing the schedule of services for all days of the church year, manuscripts of which are stored in the collections of the State Historical Museum, the Russian National and State Libraries, the Russian State Archive of Ancient Acts, Holy Trinity St. Sergius Lavra.