Scientists for the First Time Create an Interactive Database of Ancient Slavic Texts Using Artificial Intelligence Technologies

March 18,2020 yearOther, Science, Cooperation

A collaboration of scientists from the V.V. Vinogradov Russian Language Institute of the Russian Academy of Sciences, NUST MISIS, HSE, with the support of the Commission for Work with Universities and the Scientific Community at the Diocesan Council of Moscow, has launched a large-scale project to create a unique base of ancient Slavic manuscript texts — the corpus — using artificial intelligence and machine learning technologies. Creating a corpus of the ancient Slavic language will give linguistic researchers and historians a powerful tool for studying all modern national Slavic languages and cultures and will be a unique key to understanding their heritage.

A corpus is a structured language database, an information and reference system based on a collection of texts in a particular language in electronic form. It is a selected and specially processed (marked out) set of texts that are used as the basis for the study of the vocabulary and grammar of the language.

Ancient Slavic texts are a variety of manuscript monuments of the XI — XVII centuries, the foundation of all modern national Slavic languages and cultures. The creation of the system corpus of the language is associated with laborious, delicate and painstaking work, requiring the combined efforts of professionals from various fields and, according to the scientists, is a task of a national nature.

Hieromonk Rodion (Larionov), Deputy Chairman of the Commission for Work with Universities and the Scientific Community at the Diocesan Council of Moscow:

“At present, there is no corpus of handwritten Slavic texts, and its creation is considered as an important task by scholars of various disciplines. The main volume of the Old Slavic is the Old Russian, Bulgarian, Serbian texts of the XI — XVII centuries that have been preserved to this day. There are several thousand liturgical manuscripts. Language changes from century to century. Scientists need to understand why these changes occur, what is the cause for them, what affects their occurrence, and also, what these changes entailed. If we analyze and systematize all the data that ancient Slavic manuscripts represent with only human resources, it is an enormously hard work that would last for centuries, especially considering that there are very few professionals who are capable of doing this work. Recognition and digitization technologies for texts, machine translation, and AI will allow this important work to be carried out in the foreseeable future.”

Artificial intelligence will cover this entire giant data array, systematize and create algorithms for arranging linguistic markup — the main characteristic of the corpus. It is what distinguishes the corpus from a simple library.

Projects on the application of digital approaches to the analysis of cultural heritage are actively developing in European countries and are an excellent example of interdisciplinary interaction. Concerning linguistic monuments, two principal areas of work can be noted: the conversion of the scanned images into a “machine-readable” form and the creation of language models that simplify the analysis and understanding of texts. Such systemic developments have not yet been undertaken for the Slavic texts, which are characterized by the florid spelling of letters (graphemes) and widespread use of diacritics.

Andrei Ustyuzhanin, leading expert at the NUST MISIS MegaScience Center for Infrastructure Interaction and Partnership, head of the Research and Training Laboratory for Big Data Analysis Methods at the Higher School of Economics:

“Natural language is a key training ground for the development of AI technology. It is thanks to these technologies that the machine translation problems, the construction of the dialogue systems, and the tasks of text interpretation in a natural language have received a powerful impetus recently. In a sense, such a project is a bridge from the culture of the past to the technologies of the future. In our experience of interdisciplinary projects, the most important part is not to get the most advanced technology, but to lay the foundations for people to communicate with each other — language specialists with artificial intelligence specialists.”

The first stage of the project will be the digitization and marking of the complex of ancient Slavic liturgical church books of the XI-XVII centuries in Old Russian, Bulgarian and Serbian containing the schedule of services for all days of the church year, manuscripts of which are stored in the collections of the State Historical Museum, the Russian National and State Libraries, the Russian State Archive of Ancient Acts, Holy Trinity St.Sergius Lavra.