From time to time we talk on Medium about projects that participants create as part of our educational programs, for example, about how to build a spoken oracle… Today we are ready to share the results of the spring 2020 semester course.
Some data and analytics
This year we have broken all records in terms of the number of the course: at the beginning of February there were about 800 people… Let’s be honest, we were not ready for so many participants, so we came up with many points on the go with them. But we will write about this next time.
Let’s go back to the participants. Has everyone finished the course? The answer is, of course, obvious. With each new assignment, the number of those willing became less and less. As a result, either because of quarantine, or for other reasons, but by the middle of the course only half remained. Well, then I had to decide on projects. Seventy works were announced by the participants. And the most popular project – Tweet sentiment extraction – nineteen teams tried to complete the task on Kaggle…
More about the projects presented
Last week we held a final session of the course where several teams presented their projects. If you missed the open seminar, then we have prepared recording… Below we will try to briefly describe the implemented cases.
Kaggle Jigsaw: Multilingual Toxic Comment Classification
Roman Shchekin (QtRoS), Denis Grushentsev (evilden), Maxim Talimanchuk (mtalimanchuk)
it competition – continuation of the popular competition from Jigsaw to determine toxic text, however, in this case, training takes place on English data, and testing on multilingual data (including Russian). The assessment is based on the ROC AUC metric. The team took bronze (132 out of 1621) with an ROC AUC of ~ 0.9463. The final model was an ensemble of classifiers:
- XLMRoberta large
- Naive bayes
- Bert base
- Bert base multilingual
- USE multilingual
XLMRoberta large with a linear layer of 1024 * 1 was trained on a basic dataset with the AdamW optimizer. The multilingual model was used in the basic version (trained in 16 languages) without additional training. The use of the Bert base was possible due to the automatic translation of the test dataset into English. The training set has been expanded with additional datasets.
On bert distillation
As you know, models based on the BERT architecture, while achieving impressive quality ratings, still lag far behind in performance. This is because BERT is a model with a large number of weights. There are several ways to reduce the model, one of them is distillation. The idea behind distillation is to create a smaller “student” model that mimics the behavior of the larger “teacher” model. The Russian student model was trained on four 1080ti cards for 100 hours, on a news dataset. Eventually the student’s model is 1.7 times smaller than the original model… A comparison of the quality of the student and teacher models was performed on a dataset to determine the emotional coloring of the Mokoron text. As a result, the student model performed comparable to the teacher model. The training script was written using the package catalyst… You can read more about the project at Medium…
Open Data Science Question Answering
Ilya Sirotkin, Yuri Zelensky, Ekaterina Karpova
It all started with fasting in ODS from Ekaterina Karpova. The idea was quite ambitious – to create an answering machine for questions in ODS slack community based on the collected Q&A dataset. However, preliminary analysis revealed that most of the questions are quite unique, and creating a labeled test sample for assessing quality is a rather laborious task. Therefore, it was decided to start by creating a classifier to determine whether the question being asked belongs to the ODS slack channel. He would help newcomers to ODS ask questions in the relevant channel topic. The pwROC-AUC metric was chosen as a quality assessment.
The project included a comparative analysis of popular text classification models. The best of these is the RuBERT-based model from DeepPavlov – showed quality 0.995 pwROC-AUC. Such high numbers of model quality indicate a high degree of separation (and separability) of the original data. The only channel that is problematic for all the models I have tested is _call_4_colaboration. But why exactly he has not yet been found out.
Having dealt with this task, the team leaves no hope of returning to the original task of answering questions from ODS users.
Russian Aspect-Based Sentiment Analysis
Within the framework of this project, the problem of determining the sentiment relative to a given object in the text was solved (problem C from the Dialogue Evaluation 2015 competition). Both Russian and English data were used as datasets. Basically, we compared modern models based on ELMо architectures (from the RusVectores package) and BERT (from the DeepPavlov). The ELMо + CNN model in Russian showed comparable quality to the best model from the competition, despite the small training sample and strong data imbalance.
Kaggle: Tweet Sentiment Extraction
By condition competition, the task was to extract a keyword or phrase from the tweet text that would determine the mood of this tweet. The word-level Jaccard Score was used as a quality metric. In this competition, all competitors faced noisy data and ambiguous markup. The team used a public laptop model based on the RoBERTa-base as the base model. This model uses a reading comprehension approach, in which the beginning and end of the key phrase are highlighted (with the obligatory condition that the end is after the beginning). According to the accepted tradition, the ensemble of various models performed faster than individual models. As a result, bronze (135th place out of 2100)… In the experience of the winner of the competition, two-level annotation gives even better speeds.
Automatic solution of the exam
Mikhail Teterin and Leonid Morozov
The goal of this project is to improve quality metrics on three objectives AI Journey 2019 competition (automatic solution of the exam), namely:
- search for main information in the text;
- determining the meaning of a word in a given context;
- placement of punctuation marks in sentences.
In all three problems, we managed to surpass the best solution in the competition. Much of the improvement is due to the use of additional training data. In solutions, the best quality was shown by models based on RuBERT from DeepPavlov…
In this article we tried to tell about some of the projects that were presented at the seminar, but of course there were more of them.
Thanks to everyone who took an active part in the course and did not give up. Well, for those who are just learning and looking for interesting problems in the field of NLP, we recommend considering DeepPavlov Contribute project… The future of Conversational AI is in your hands!