Transformers and Hate in Vancouver: How Anti-Plagiarism Rides the NeurIPS-2019
At the end of last year, a conference was held in the Canadian city of Vancouver NeurIPS-2019. A search in Habr gives seven references, among which there is not a single report or review – a strange gap, given the level and scale of the event in 2019. We at Antiplagiarism decided to fill this gap with a story about the impressions of two Nypsum neophytes in the haute world
couture data science.
The night, Domodedovo, check-in, and then a very short dock in Frankfurt, at which it is already clear that there will be no crowding at the conference. Hurried people in glasses and corporate hoodies were inundating the transit zone, and the landing line itself already looked like a good (sorry, nonexistent) Russian conference. Then a ten-hour flight awaited us, which turned into a hackathon: in the cabin here and there black screens flickered with the terminal or dark ide shell. In the sky above Greenland, it seems, more code has been written than ever on its surface.
The time difference is 11 hours, so upon arrival, we immediately faced the brutal reality of jetlag. Having located not far from the venue (Vancouver Convention Center, which consists of two buildings with a total area of 43,340 sq. M., Which, for a second, has almost six football fields) and hardly having waited, as expected, in the evening, local time, we fell asleep.
The first day, when we were paid in full for patience.
December 8, the first day of the conference. The organizers noted in a letter sent specially the day before that they needed
die but come to registration strictly on the first day. Arrived at the agreed 9 am and immediately stumbled upon a queue that starts on the first floor and goes to the second, folds, curls and folds again, going around the corner. It stretches out and turns around the corner again, where after a couple of hours of waiting (the queue for the Anacondaz concert in Moscow, by the way, resolved in just 1 hour) we get the coveted badges and cool mugs.
Come early, they said … (everyone who checked in the next day did it without much effort)
Waving badges in front of an actively growing line, we go to the next building, where Expo Day is planned for today: stands and seminars of large sponsor companies. The seminar rooms are empty, the speakers are trying to grab the attention of the rest of the audience, and at this time in the hall with company stands is full of people. Coffee and sweets are served here, and leading corporations in the industry (Facebook, IBM, Google, Apple, etc.) smartly talk about themselves, register people on their career sites and generously distribute hats, adapters, socks and invitations to corporate parties. Some seem to be already interviewing.
Merchants bag from sponsors (the bag itself is also merch)
View of the East Center building and the bay
The second day, when everything seemed to be lost.
The next day, the action flared up. Oleg_Bakhteev and I joyfully ran to absorb advanced science. We listened to the excellent performance of Kyunghyun Cho about the Imitation Learning paradigm, combining the advantages of RL and classic Supervised Learning. On this, however, it was all over, the seminars that had already become traditional continued for the rest of the day. Black in ai, Women in Machine Learning, LatinX in AI, Queer in ai and New in Machine Learning. These seminars were interspersed with one of three matches to choose from where Efficient Processing of Deep Neural Network: from Algorithms to Hardware Architectures, Machine Learning for Computational Biology and Health and Interpretable Comparison of Distributions and Models we chose effective dipllerning and … lost. The obvious bottlenecks and tradeoffs that have arisen in pursuit of efficiency have been described with inspiration and detail. The day is over for us series of reports Reinforcement Learning: Past, Present, and Future Perspectives, where on the big screen almost all two hours circled, fell and rose various computer simulations of little men from sticks. It was fun. So much so that I did not want to go to a philosophical presentation by a psychologist from Berkeley entitled How to Know with a florid announcement.
The third day, when our minds were filled with hope.
When we were already desperate to hear at least some breakthrough news of machine learning from the mouth of the speakers, knowledgeable people suggested that everything cool and the present happens at the poster session. Great, she’s just starting today. Let’s go listen to the highlights. Highlights – this is when everyone gathers, sits down and listens to the five-minute reports of the authors of the best works that will be at the poster session. People are desperately trying to photograph the presentation and are very upset when the presenter switches the precious slides. It seems that all this is necessary so as not to wander among three or four hundred posters without a goal, but to highlight really interesting. After an hour of highlights, we set off to watch the posters with the confidence that there will really be a lot of interesting things. The poster session is located in two united exhibition spaces, on the way to which the line has stretched. Once inside, we scatter to look for related topics and favorite material from the highlights. Everything is very good, but in order to talk with the author, you need to stand in line or, accidentally catching the middle of the story, wait for the beginning. Fatigue from the continuous queue and attempts to make out the poster through the head rolls quickly enough. Only vigorously snooping without a cap gives strength Schmidhuber. As a result, we managed to find and listen carefully to about ten interesting works. Nice catch compared to previous days.
The fourth day and the days following, when, finally, it started.
The next day, knowledgeable people again give us a valuable hint: it’s not necessary and even contraindicated to go to listen to the highlights, because you need to run to the posters while they are only hanging – there are almost no people, and the authors are already willing to answer questions. So they did. The tactics worked – they talked with colleagues a lot and productively, watched a large number of interesting works. We followed the same plan in the future, sometimes trying to taste the speeches of the speakers, but always agree that we should not take them for a break from going to the posters. Thematic workshops in the last two days of the conference also pleased with the richness and relevance of information. The works, broken down on narrow topics, were placed on the walls of a small audience, there were speeches and lively discussions.
Document Intelligence Workshop
We arrived at NeurIPS 2019 not just like that, but as participants in the Document Intelligence workshop, which is dedicated to the intellectual processing of documents. The vast majority of the workshop’s tasks were related to optical recognition of texts and suppression of artifacts in scanned documents, the allocation of entities from sales receipts or contracts. Oleg_Bakhteev and I presented our work on finding cross-language borrowings CrossLang: the system of cross-lingual plagiarism detection, popularly about which you can read on Habr. Here we dwell in more detail, digress from the general impressions of the conference and make a small digest of workshop articles. A brief and obvious result – the past year has become BERT’a year for our region. The content of all workshop articles is (almost) in one line below:
- CrossLang: the system of cross-lingual plagiarism detection. Our article is about the system for detecting transferable borrowings. The problem of finding borrowed fragments of input text in Russian in a collection in English is considered. We used a bunch of translator + trained semi-supervised encoder-decoder to compare translated sentences. The resulting system successfully works in the prod, serving a large number of universities.
- Repurposing Decoder-Transformer Language Models for Abstractive Summarization. The problem of abstraction summarization is considered. It is shown that using a pre-trained transformer decoder, you can get good results, considering the task as language modeling. Without beam search and other decoder optimizations, but just decoding greedily.
- From Stroke to Finite Automata: An Offline Recognition Approach. There is an electronic system for teaching students Computer Science. To study finite state machines, a recognition system for hand-drawn diagrams is made. The dataset for the task is presented.
- Post-OCR parsing: building simple and robust parser via BIO tagging. Splitting information from checks into groups. Each token is classified into Start-Inside-Out (BIO) using BERT embedding. We made our own dataset for this.
- BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. I want to use the full picture of the page and text. BERT for text, CNN for pictures, we get contextual representations of elements on the page for subsequent tasks, such as classifications. It is also used on checks.
- Chargrid-OCR: End-to-end Trainable Optical Character Recognition through Semantic Segmentation and Object Detection. The OCR task is considered as an object-segmentation task for very tightly lying objects. There is no special preprocessing, pure pixels are given. Compared with Tesseract and CNN-RNN.
- SVDocNet: Spatially Variant U-Net for Blind Document Deblurring. Make image scans clear with U-Net.
- Semantic Structure Extraction for Spreadsheet Tableswith a Multi-task Learning Architecture. Multi-tasking framework for working with tables: both cell interior semantics (BERT) and cell type (CNN) are taken into account.
- Document Enhancement System Using Auto-encoders. Cleaning scanned documents from erosion, artifacts, watermarks. They took the finished architecture of the Residual Encoder-Decoder Network. The dataset consists of clean and relevant noisy documents. Reconstruction error is minimized.
- CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. We made a dataset with marking up checks for zones and their values.
- On recognition of Cyrillic Text. We made a dataset for recognizing handwritten Cyrillic languages.
- Representation Learning in Geology and GilBERT. Search for similar terms in geological documents using BERT.
- Neural Contract Element Extraction Revisited. Extract entities from contracts: parties, dates, money, etc. Consider the task as a sequence labelling. Tried BiLSTM, dilated-cnn, transformer, BERT. BiLSTM worked best with CRF on top. As inputs used domain-specific w2v.
- Doc2Dial: a Framework for Dialogue Composition Grounded in Business Documents. A dialogue agent that responds to a user request based on an array of documents.
- On Domain Transfer for Intent Predicting in Text. An article about the situation when there are public datasets (emails), but we want to use them on closed datasets (real user letters). They may come from a different distribution and break down the basic premises of ML. Various techniques for detecting distribution differences are introduced.
- Towards Neural Similarity Evaluators. The problem of summation and its quality metric are considered. There are a lot of problems with BLEU and ROUGE, so we took the RoBERTa architecture and completed it on the Sentence Similarity Task. Quality metric – a comparison of the resulting vector representations.
In the end, as expected, conclusions. For the first two or three days, the conference warms up, so if you eat for science, you can safely skip them or watch Vancouver and its surroundings, recovering from the jet lag. If you come to get a job in an industry or academy (and get merch), then at Expo you have a chance to find a job in a large (and not so) company. Well, all the stars from the academy, laboratory leaders are also at the conference, so there is a chance to meet and chat.
So it turned out for us NeurIPS 2019 🙂 We hope that the article was interesting and useful for the habrovoy ML-community.