Last school year, April. Student more and more often begin to attend the thought that it would be necessary to do thesis. To do it is, in the sense, to figure out how to quickly cook up something that will be at least in tune with the topic that, it seems, was approved by the supervisor. And, yes, you need at least 80 pages, and you must comply with all sorts of GOSTs … It's clear, you don’t have time to type so much coherent text yourself (and they can even get into the essence of the work, well, it!). Obviously, we must take the finished work that we have already defended, quality work, tested and approved. The situation is familiar to all of us. The only question that remains open is how to make sure that the work is tested for borrowing … Internet searches and communication with colleagues in misfortune lead the student to the following solutions to the problem:
Write the work yourself;
- To rephrase the text (expensive and difficult);
- Outwit the system with "technical workarounds."
Let's see what technical rounds are, how we catch them and why their application is not a good idea …
Rephrasing can help pass off someone else’s text as your own, if it is done well. However, high-quality rephrasing in itself is a very laborious process for which the student most likely does not have the time and money. Simple methods of paraphrasing (for example, synonymization) will give a result that will not only be detected by the Anti-Plagiarism system, but also, most likely, will amuse the supervisor and the certification committee.
Thus, we come to the most creative and most popular means among students – technical workarounds – document transformations, which, without changing the display of the original document, change the text extracted by the checking system.
From the point of view of working with technical detours (hereinafter we will simply call them “detours”), the Anti-Plagiarism system has two tasks:
- Detection of potential detours and notification of the user about them;
- Clearing checked text from crawls.
The general scheme of processing rounds can be described as follows:
- Detection of bypasses, saving information about them;
- Clearing the extracted text from crawls;
- The definition of "suspiciousness" of the document based on the detours;
- Display information about suspiciousness to the user, display of found detours.
This is how it looks in practice.
Document in docx format:
Checking a document without crawl detection functionality:
The document has one hundred percent originality.
We check the document with the bypass detection functionality turned on and see that the originality drops to 0.
In addition, the system marks the document as “Suspicious” and shows the user where and what kind of detours were detected:
Since the purpose of technical rounds is to increase the originality of the document, it is interesting to classify them according to how they affect the verification of the document. Based on the fact that the main element of checking a document for borrowing is the document’s words, workarounds can be divided into the following types according to their effect on the document’s extracted words:
- Change the word (the word in the extracted text differs from the word displayed in the source document);
- Adding a word (the word is not visible in the source document, appears in the extracted text of the document);
- Deleting a word (the word is visible in the source document, not in the extracted text of the document);
- Word breaking (in the original document the word is displayed normally, in the cured text it is divided into two or more parts);
- Merge words (several words are displayed in the source document, they are merged into one word in the extracted text).
Let's see what workarounds we run into. Let's start from the simple ones and go towards the most interesting ones.
Bypasses of this type are in no way tied to the format of the document; they change the string value of the words so that they continue to look identical to the original words.
One of the first workarounds we recorded was replacing letters with omoglyphs — symbols that are visually similar to the original letters and have different meanings. Omoglyphia has been used from the earliest days of the existence of the Anti-Plagiarism system, and despite the fact that we have been catching it for a long time, we still encounter similar detours in student work.
Omoglyphs are easy to find and clean when the language of each word is known. We can quite qualitatively determine the language of each word of the text, even when the text contains several languages and a large amount of “garbage” (homoglyphs and other extra characters). How is a topic for a separate article. Having the word language and a list of possible homoglyphs for the language, we restore the letters of the original language and save information about the found homoglyphs.
Another way to change the string value of words without significantly changing their display is to use invisible or weakly visible Unicode characters. Inserting such characters in a word changes the string value of the word, while practically not changing its display.
Many of these characters are in the Unicode categories Other, Control, and Mark, Nonspacing.
The system simply deletes these characters and, when there are a large number of them, notifies the user of the suspiciousness of the document, displaying cleared unprintable characters in the report.
As we said earlier, the key format for processing documents is pdf. We convert all other types of documents to pdf, so that the basic logic of processing documents we have become unified for all supported formats. Thus, the workarounds that can be implemented in pdf documents are of particular interest to us.
A workaround that one of the first comes to mind is to make something small and invisible. The text thus obtained is not visible when viewing the original document, but is retrieved by the system. The implementation is very simple – set the minimum font size for the text, change the color of the text. Catching bypasses of this type is just as simple – just check the font size of the text and the geometric dimensions of individual words. Due to their small size, students often add whole paragraphs of such hidden text to the page:
Display of a detected crawl attempt:
Change text color to background
Despite the fact that this method is often used in combination with the previous one, its independent use is more interesting. The fact is that for us to detect and clear the bypass, it suffices to determine that at least one parameter of the word / symbol has a “suspicious” value. And, if the definition of small sizes of a word is trivial, then the definition of text whose color matches the background is a more complicated procedure.
Detecting an invisible text is complicated by the following circumstances:
- From pdf it is not always possible to get the color of a particular character;
- The background of the word may not be white. Moreover, the word may be on the background of the image;
- Words and symbols can run into each other.
To eliminate the first two difficulties, the “invisibility” of the text is determined by analyzing the rendered image of the document page:
- Define the area of the page containing the word;
- We calculate the variance of the obtained region. If the variance is below a certain threshold – in the analyzed area we have a uniform color, no letters are visible. Therefore, in the face of an attempt to bypass the system.
Words and symbols hidden one after another
Invisible characters cannot be detected by analyzing the area in which they are located if these characters are hidden behind other “visible” characters. Therefore, to detect such “hidden” characters, we have a separate procedure that analyzes the intersection of symbol areas and marks those characters that are largely overlapped by others.
Text as Images
What will happen if we take and replace part of the text with images containing this text? With proper accuracy, everything will look as if nothing in the document has changed, but when you extract a text layer, naturally, words from pictures will not be extracted. To close this gap, we use optical text recognition.
Workarounds using docx to pdf conversion features
Converting documents to pdf is not a trivial task. You can read about how we chose the most suitable solution for us here (https://habr.com/ru/company/antiplagiat/blog/458842/). Unfortunately, even the best of the options we have analyzed does not ideally convert documents to pdf. Some "features" of conversion are actively used when trying to bypass the system.
Formulas and a number of other objects containing text are “lost” after conversion to pdf. Thus, you can try to hide the whole paragraph of the text, or, for example, every second word in the text:
When converting to pdf, we get the following result:
To detect and clean this and other workarounds, sharpened by the features of converting docx to pdf, we analyze and clean the source docx file. In particular, if a significant number of formulas are found in a document, we replace them with plain text, which will be saved when the document is converted to pdf. Moreover, we remember the positions of the formulas that we processed, and if necessary, inform the user about the suspiciousness of the document being checked and highlight the text that we restored from the formulas.
Scale, small intersymbol / line spacing
When converting to pdf, a number of text properties are not taken into account: scale, intersymbol and line spacing. This allows you to add text that is invisible in the source document (for example, it has a very small scale), which in pdf becomes normal, nothing stand out text. Bypass implementation (docx):
The result of the conversion to pdf (we changed the color ourselves):
The only way to catch this text is to find it in docx and save information about it. If we find a lot of such text in the document, we mark the document as suspicious and show the user where we found text with suspicious attributes in the document.
Breaking a word into pieces
An interesting special case of applying the properties described in the previous paragraph is to add a space to the word and hide it. In the original document, the word will look normal, merged, and after converting the document to pdf it will be split into two parts, since the space becomes full-sized. We catch a similar feint with our ears in about the same way as in the previous paragraph. Bypass implementation (docx):
The result of the conversion to pdf:
Display of a detour bypass:
Under the old chestnut tree, in the light of day, I betrayed you, and you me …
We talked about the basic, but by no means all technical ways of implementing workarounds. Of course, we are unlikely to ever be able to make the defense absolute. Nevertheless, we are constantly improving our system, leaving fewer and fewer opportunities to “deceive” it. In the session, we try to close detectable loopholes especially quickly – often from the moment a gap is discovered until it is closed at the prod, only a few days pass. That is why it is a little ridiculous and, at the same time, sad to read the advertising "promises" of companies that are ready to help students increase the originality of their work and give a guarantee for their work, sometimes reaching 30 days. Student, you will be betrayed! In the best case, this “guarantee” can return the cost of the services of a crawler company to you, but it will not help in any way with a failed diploma and potential expulsion from the university …
Create with your own mind!