How the creation of a binary classifier opened a Pandora’s box in English proficiency standards

English proficiency is usually assessed according to the CERF (Common European Reference Framework) system, which consists of six levels, where level A1 is for beginners, and level C2 is for professionally speaking a foreign language. The international level C2 is often positioned as the “level of an educated speaker”, and obtaining an appropriate certificate is often either a cherished dream or a source of pride for a language teacher.

However, I have not seen in the scientific literature evidence of a complete correspondence of the C2 level to the level of English proficiency as a mother tongue. In fact, there is no consensus among scientists about whether it is even possible for language learners to achieve a level identical to native language proficiency (here are two articles with almost the same title and opposite conclusions [1; 2]). After conducting a small survey in one of the social networks, I saw that most of my fellow English teachers deep down still believe that “there is an abyss between the native level and the C2 level.” Although there were those who chose the option that C2 is really the level of an educated carrier.

So is there a difference or not? I decided to figure it out, having considered for a start only one aspect of language proficiency – written speech. I want to tell you about my experiment, in which artificial intelligence was involved.

First, I created a survey on Google Forms and offered 17 Russian-speaking colleagues the following challenge: to determine whether the English text was written by a native speaker (British) or a Russian-speaking author with an English level of C1-C2. There were 20 texts in total. Experts with extensive experience in checking student essays and reading original texts were invited to the study, but, nevertheless, the task was not an easy one. Having manually calculated the metrics, we get: Accuracy = 0.6617; Precision = 0.6627; Recall = 0.6588; F1 = 0.66. I note that I also offered this survey to native Britons (so far only three), and the preliminary result is about the same.

We could stop there, drawing the reassuring conclusion that there is no difference between native speakers and advanced users in writing texts, since the experts could not detect it.

But something made me try to dig deeper, applying my modest knowledge of Deep Learning. This is how a binary classifier model based on XLM-Roberta appeared, which learned to distinguish texts written by native speakers from essays by Russian-speaking authors of the C1-C2 level. I’ll tell you more.

The first step was to create a database of texts. Colleagues preparing advanced students (often teachers) for international exams donated 160 essays in a semi-formal newspaper style (in the genres of article, essay, review and letter) to science. I divided them into training, test and validation samples in the proportion of 70% : 15% : 15%, as is customary to do.

For the base of media texts, I decided to use a ready-made dataset prepared in Cambridge [3]. I took as a basis 160 texts from those that are used to assess reading skills at the international Cambridge exam CPE (Cambridge Proficiency Exam). It seemed to me that these should be authentic texts, but something went wrong. The use of texts from the Cambridge English Readability Dataset (2016) gave a very low result (Accuracy = 0.57).

And again, one could assume that the point is the lack of difference between the texts and, therefore, the levels of language proficiency. But a closer examination of the texts of the Cambridge Dataset showed that they contain words that the Cambridge Dictionary [4] are marked as obsolete (for example, “brouhaha”). When exactly the examination texts were written – the authors of the dataset do not indicate, but it was probably around the 90s of the last century. It can also be assumed that the texts were edited to suit the format of the exam or were written specifically for it. Among other things, in most texts there are errors in the design – such as the lack of spaces and punctuation marks between the title and the text, “glued” sentences (no spaces between them), and the absence of apostrophes (eg “concert-goers experience”). Of course, all this could be an obstacle to learning the neural network.

Convinced of the imperfection of the Cambridge developments, my inner perfectionist asked me to try harder and personally collect a database of texts written by native speakers. Which I did using the online websites of famous British publications such as The Independent, The Guardian, Reader’s Digest UK, The Vogue UK, The Evening Standard and others. When selecting texts, the genre specifics and the volume of texts were taken into account, with an eye to the fact that the model would not be able to handle more than 512 tokens at a time anyway. It was also decided to abandon the headers, since their presence in itself can become a marker for the model.

And what? Thanks to working with the base, the result has grown to Accuracy = 0.957. Subsequently, it was improved a little more through various “dances with a tambourine”, and as a result, the model works with the following metrics: Accuracy = 0.9782; Precision = 1.0; Recall = 0.9583; F1 = 0.9787. And this is already becoming interesting for me as a linguist.

The same survey that I offered to fellow experts, I ran through my classifier. He made a mistake in one text out of 20 – he mistook a native speaker for a non-native speaker. Total Accuracy = 0.95; precision = 1; Recall = 0.9; F1 = 0.947. By the way, none of the respondents-experts passed the survey with such accuracy.

Thus, the AI ​​classifier coped with the task of binary classification of English texts according to the native language of the authors much better than qualified specialists. This allows us to draw several interesting conclusions:

1) the difference we were looking for exists. Using the capabilities of artificial intelligence, we proved that the English-language written text created by native speakers, according to some system characteristics, is very likely to differ from the writing of native Russian speakers who speak English at the C1-C2 level according to the CERF system;

2) artificial intelligence recognizes these differences with much higher accuracy than human experts.

The results of the study seem to provide food for thought.

I’ll make a reservation right away that with my work I would not at all want to provoke “native-speakerism”, since I am against discrimination against teachers in their native language. Very often, knowledge of the Russian language is a great advantage of an English teacher. For example, I specialize in staging British pronunciation, and after numerous internships in the UK, I was convinced that a rare Briton would set British sounds to a Russian student in the same way as a professional Russian-speaking phonetician would do, relying on the student’s native articulation base and his personal experience in staging sounds.

However, it must be admitted that the difference in the production of written texts, which could only be reliably detected with the help of AI, proves the existence of a certain “grey zone” between the level of C2 and English as a native. And the study of this gray zone would, firstly, improve the understanding of the processes of creating written texts in English, and secondly, would help both teachers and learners of the language to develop writing skills more effectively.

And the last thought that arises from the results of the experiment: if expert teachers could not distinguish between a student and a native speaker, but AI could, does this not open the door to a world where the level of human proficiency in at least some language skills will be determined by a non-human?


1. Is it possible to achieve native-like competence in second language acquisition? – LDG Martha Adriana Maza Calvino. – Tlatemoani. Revista Academica Investigacion, 2011, 9pp.

2. Is native-like competence possible in L2 acquisition? – Sylvina Montrul, Roumyana Slabakova. – Proceedings of the 25th BUCLD, 2001, 13 pp.

3. Menglin Xia, Ekaterina Kochmar and Ted Briscoe (2016). Text Readability Assessment for Second Language Learners. Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications.

4. Cambridge Dictionary Online

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *