Last fall, after a long break, I gave the course “Introduction to Information Retrieval”, this time for students of JetBrains academic programs at HSE St. Petersburg and ITMO. Come-back was continued – in the winter I gave a mini-course with an overview of information retrieval models and approaches to assessment for Tinkoff employees, and in the spring I gave an overview lecture on information retrieval as part of the course on natural language processing. In this article, I will briefly discuss the course and its “historical background”.
In the 2000s, the Web developed rapidly, and with it the search engines. Information search has received a powerful impetus for development in the form of complex tasks, money and an influx of specialists from related fields. By this point, the main textbooks on information retrieval were Ian Witten “Managing Gigabytes” and Ricardo Baeza-Yates, Berthier Ribeiro-Neto “Modern Information Retrieval”released in the 1990s. An excellent modern textbook appeared in 2008 Christopher Manning, Prabhakar Raghavan, Hinrich Schütze “Introduction to Information Retrieval”(The second edition was published in 2009). Together with Ilya Segalovich I participated in the translation and editing of the Russian translation, which released in 2010 with the support of Yandex. While around the workshop ROMIP and schools RuSSIR an active community of people interested in the tasks of information retrieval has been formed, many of them participated in the translation of terminology… The work on the book helped me to better understand the discipline, and since 2009 I have been giving the course of the same name at the ShAD and Ural University. In the 2014/15 academic year, thanks to a grant from the Dynasty Foundation, I gave a course at ITMO and St. Petersburg State University, then there was a big break.
After being offered to read the course, I decided to brush up on my knowledge and looked through a lot of study materials on the topic. An excellent overview of resources can be found in the article Ilya Markov, Maarten de Rĳke “What Should We Teach in Information Retrieval?“… (By the way, Ilya Markov taught a course on information retrieval in St. Petersburg before me – in 2017 and 2018.) As a result, I took the materials as a basis Stanford course, the logic and content of which I am familiar with (and the main textbook of the course is the same “Introduction to Information Retrieval”). The course introduces students to basic concepts such as data structures and information retrieval models, indexing and search optimization, scoring techniques, and more advanced topics such as machine learning ranking, web link structure analysis, and question-and-answer search. As “additions” I included, for example, Russian morphology and snippet generation. The practical part consisted of three tasks: processing a large text collection, indexing and searching the collection using Elasticsearch, and training the ranking function.
The practical exercises were based on the ROMIP By.Web collection and the data from “Internet Mathematics 2009”. This is a good example of how open data is helping research and education, even many years after its creation. ROMIP (RRussian seminar on ABOUTprice Mmethods ANDinformation Poisk) was organized in 2003… In many ways, ROMIP focused on Text REtrieval Conference (TREC) and saw his task in conducting an open assessment of information retrieval systems on Russian-language data. By.Web is the largest collection of ROMIP, it contains approximately 1.5 million pages of the Belarusian domain .by and, most importantly, estimates of the relevance of documents for approximately 1,500 search queries. In February 2020, the collection became freely available… The number of tagged requests and the simple fetch procedure are significant advantages over TREC collections.
Participants of the 2004 ROMIP seminar in Pushchino. Bottom row: Marina Nekrestyanova, Ilya Segalovich, Mikhail Maslov, Igor Nekrestyanov, Max Gubin, Vladimir Pleshko. In the top row Dmitry Pankratov, Lev Gershenzon, Vladislav Shabanov, Alexander Antonov, Andrey Fedorovsky, Boris Dobrov, Elena Kozerenko, Mikhail Ageev.
For the task of training the ranking function, we used data from the Internet Mathematics 2009 competition organized by Yandex. The set contains vectors of 245 features and relevance scores for approximately 20,000 request-document pairs. As part of the preparation of the course, this data, which disappeared in a series of redesigns of Yandex corporate sites, was found again make available (special thanks to Natasha Ostasheva). In 2010, Yahoo! held similar competitions from data set larger, but, again, not easy to access.
Additional materials and links can be found at course page… I would like to thank Vladislav Korablinov, who was the teaching assistant of the course, Ilya Markov for the opportunity to get acquainted with the materials of the previous course on information retrieval and Svyatoslav Demidov (auto.ru) for a guest lecture within the course. Thanks to Yandex and personally to Mikhail Ageev, Max Gubin and Igor Nekrestyanov for supporting the collections and providing access to them. And, of course, thanks to the students – for their curiosity and perseverance.