Binoculars method promises high accuracy in text detection from large language models

ChatGPT writes as well as a human, but is it possible to detect “machine” in the text? Although some companies would be more profitable to present everything as if the output of language models is indistinguishable from a human one, research in this direction is actively underway. Authors of the scientific article “Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text” (arXiv:2401.12070) claim that their method has a low false positive rate (0.01%), correctly detects text from language models 90% of the time, and works for several modern product families.

People and worms

The range of behavior of large language models is wide. This makes it much more difficult to create a tool for detecting text generated by BNM. It is even considered impossible.

The practical value of such a tool is difficult to underestimate – it is extremely necessary for the direction of reducing harm from artificial intelligence. The applied purposes are varied, but due to the proliferation of nuclear fuel on transformers, all of them are difficult.

For example, the fight against fake news, spam and other generated content is difficult. A crowd of bots in the comments is not something out of distant science fiction: for such already complaining. Under the tweet with the news about the closure of all borders of Finland, a bunch of approving, but clearly written BYYAM comments suddenly appear.

Nothing can be done about it. In one of the scientific articles (arXiv:1905.12616) it was shown that identifying a fake news article from a machine and a real news article from a person was only 73% accurate. The article dates back to 2019, and language models have only continued to improve since then.

It is also obvious that the BNM detector is extremely necessary for detecting plagiarism and fraud in the academic environment. This also means finding out the original source of the text. But even here, detecting traces of even not the most technically advanced language models is difficult, if not impossible.

As shown in 2022 (arXiv:2201.07406), a student armed with an open GPT-J (6 billion parameters), can easily deceive the anti-plagiarism system MOSS. Although GPT-J is not trained for such tasks, it writes code for an introductory programming course that does not raise the suspicion of MOSS. The program code that GPT-J writes is varied in structure and does not have characteristic features that can easily indicate the use of BYM.

If the detector works with excessive “enthusiasm”, this is also bad. In one of the experiments (arXiv:2304.02819) commercial GPT detectors have been shown to erroneously flag English proficiency test takers’ essays TOEFL as a product of BYAM. The reason for this is the data on which these tools were trained and evaluated: no one thought about non-native speakers.

It seems that no one can detect text from AI. Even OpenAI, the leader in text bots on transformers, closed your text detection tool from BYAM. The company did this in July 2023 without an announcement, without commenting on its failure.

In September the media discoveredthat OpenAI admitted its own helplessness in this matter. IN section questions and answers for educational needs in one of the articles the company explicitly admits: AI detectors do not work. As OpenAI writes, the organization’s own research has shown the unreliability of such tools. In the same paragraph, the company warns that it does not guarantee the accuracy of third-party AI detectors.

However, OpenAI now receives its main income and investments from ChatGPT and DALL-E, and not from thinking about the dangers of artificial intelligence. Why scare venture capital with talk about the harm of language models for the global economy and society? Recent company shakeups and the general rocking of the armadillo hint that theorists have been pushed deep into the background, making room for managers and entrepreneurs.

The rest of the scientific community did not stop working. How accurate can be the assessment of whether a text was written by a person or BYAM? Several scientific works (arXiv:2309.08913, arXiv:2002.03438, arXiv:2303.11156) attempt to estimate the limits of accuracy, but generally agree that the creations of completely general models will be indistinguishable from human ones. However, others (arXiv:2304.04736) claim that even close to perfect BNMs can be detected with a sufficient number of samples.

The most primitive detection methods suggest: watermark the generated text (arXiv:2301.10226) or save it on the service side somewhere in the database for later search by semantic similarity (arXiv:2303.13408). Of course, the gradual development of open NML, which runs locally on the user’s equipment and can be further trained in any way, puts an end to such recommendations.

Methods for analyzing already generated text can be divided into two groups:

Detector models: pre-trained models are additionally trained for the text detection task from BYML (arXiv:1908.09203, arXiv:1905.12616, arXiv:2305.12519), including using adversial training techniques (arXiv:2307.03838) and abstention (arXiv:2305.18149). In Ghostbuster’s solution (arXiv:2305.15047) a linear classifier is used. (In English, the term “literary negro” is translated as “ghostwriter”, “literary ghost”.)

Identification of statistical fingerprints that are characteristic of the BML lexicon. Texts for training here are either not needed at all or few, and quick adaptation for new families of models is also possible (2022.lrec-1.744). There are many approaches: calculating perplexity (new year, new features, new model, arXiv:2305.1822, arXiv:2305.14902) and its curvature (arXiv:2301.11305), log rank (arXiv:2306.05540), n-gram analysis (arXiv:2305.17359) and so on. There are several review scientific articles for this area: arXiv:2310.15264, arXiv:2303.07205, arXiv:2309.07689section 5 of the article arXiv:2301.07597.

Binoculars

In the scientific work “Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text” (

arXiv:2401.12070

) a method is proposed to detect texts from BYML with high accuracy. Binoculars outperforms both other published methods and commercial solutions, the researchers promise.

As the name suggests, binoculars are an optical instrument consisting of two telescopes. In a similar way, the Binoculars text evaluation tool looks at text from the perspective of two language models.

Perplexy is one of the most common metrics when working with BYM. In simple terms, perplexing shows how unusual and surprising the data appears to the model. It would seem that the trick is in the bag: it’s easy to detect texts from AI using perplexity, since texts from people are freer and more original, they have a higher perplexity index.

However, as the authors of the Binoculars method show, simply perplexing is not enough. To illustrate, here is a text about a capybara astrophysicist, which in a scientific article is called the capybara problem:

Dr. Capy Cosmos, a capybara unlike any other, astounded the scientific community with its groundbreaking research in astrophysics. With his keen sense of observation and unparalleled ability to interpret cosmic data, he uncovered new insights into the mysteries of black holes and the origins of the universe. As he peered through telescopes with his large, round eyes, fellow researchers often remarked that it seemed as if the stars themselves whispered their secrets directly to him. Dr. Cosmos not only became a beacon of inspiration to aspiring scientists but also proved that intellect and innovation can be found in the most unexpected of creatures.

The text tells the story of Dr. Kapi Cosmos, who amazed the scientific community with his deep powers of observation and unparalleled flair for interpreting space data. It was as if the stars themselves were whispering the secret of the formation of the Universe to Dr. Cosmos as he looked at them through the telescope with his big round eyes. The capybara astrophysicist has shown that even the most unexpected creatures can discover intelligence and a thirst for innovation.

For all the expressive phlegmatic nature of its muzzle, the South American capybara cannot conduct scientific activities. It is also unlikely that even GPT-4 could write such text on its own, without outside help.

Indeed, this text is the result of a request in the ChatGPT chat with the prompt “Can you write a few sentences about a capybara that is an astrophysicist?” [«Можешь написать несколько предложений про капибару-астрофизика?»].

However, when analyzed with the Falcon BYM, the perplexity is high – 2.20. This is much higher than the average for both machine text and human text. The DetectGPT tool assigns a score of 0.14 to the text, which is below the 0.17 cutoff, and the text is considered to have been written by a human. GPTZero is also wrong: the probability of AI authorship is estimated at 49.71%.

The fact is that in the real world, it is not the prompt with the response that is analyzed, but only the response. The words “astrophysicist” and “capybara” will appear in one sentence, which will lead to a highly unexpected result.

To solve the capybara problem, the authors of Binoculars introduce cross-perplexy in addition to perplexing [cross-perplexity]that is, an indicator of how unusual the prediction of the next token looks for a different model.

The cross perplexity formula measures the average cross entropy for a token between the output of models M1 and M2. The sign · denotes the scalar product. (Of course, the model tokenizer must be the same.)

$\begin{equation} \log \text{X-PPL}_{M1,M2}(s) = -\frac{1}{L} \sum_{i=1}^{L} M1(s)_i \cdot \log(M2(s)_i) \end{equation}$

Here $inline$ And $inline$ – abbreviated form $inline$ And $inline$ Where $inline$ is the sequence of text characters to be tokenized, and $inline$ — tokenizer. $inline$ — number of tokens in $inline$ .

The Binoculars method is based on a two-model mechanism. Instead of analyzing the raw perplexity, it is proposed to measure how unexpected the tokens from a string appear in relation to the underlying perplexity of the language model that was presented with the same string.

The nature of the text may be such that it will have high perplexity, regardless of who completes it – a person or a machine. But if a person adds, then the perplexity of the next token will be even higher.

If we normalize the observed perplexity relative to the perplexity of the machine that works with this text, we get a metric that is almost independent of the prompt. For this purpose, the indicator is calculated $inline$ is the ratio of the logarithms of perplexion and cross perplexity.

$\begin{equation} B_{M1,M2}(s) = \frac{\log \text{PPL}_{M1}(s)}{\log \text{X-PPL}_{M1,M2} (s)}\end{equation}$

The numerator is an indicator of how unexpected the line looks to the model $inline$ . The denominator measures how unexpected the model’s token predictions are $inline$ for model $inline$ . If the text was written by a person, then his text will deviate from the model’s expectations $inline$ more than a model $inline$ moves away from expectations $inline$ .

For the text about the astrophysicist-capybara index $inline$ was 0.73, which is below the cutoff of 0.901. Using the Binoculars method, the text is confidently identified as machine-made.

Researchers recommend using $inline$ And $inline$ similar BYAM. For their experiments and benchmarks, they used the Falcon-7B ( $inline$ ) and Falcon-7B-instruct ( $inline$ ).

Benchmarks

In their work, the authors of the Binoculars method state that the most important are false positive detections, that is, those cases when human text is regarded as machine text. Therefore, they pay most attention to true-positive rates (TPR) with low false-positive rates (FPR). The FPR was set to a threshold of 0.01%.

For a selection of benchmarks, the researchers found three sources. Ready-made datasets made a significant contribution OpenOrca and M4 (arXiv:2305.14902).

Another source is what the authors of the work about Ghostbuster came up with (arXiv:2305.15047):

Subreddit /r/WritingPrompts. The community seems to be created for such research: some participants propose an idea for a story, while others write a short story in the comments. Nevertheless, the subreddit is already more than 13 years old, it originated long before the BYAM boom on Transformers. To avoid data contamination (the presence of benchmark data in the training dataset), we isolated data from October 2022, after the release of ChatGPT. Based on these prompts, gpt-3.5-turbo wrote his stories.

News articles from the Reuters 50-50 dataset (DOI:10.24432/C5DS42) from a 2006 scientific article (DOI:10.1007/11861461_10). This dataset contains 5 thousand articles from 50 journalists. ChatGPT was asked to compose headings for articles, then he (separately, without the original in the context window) wrote new articles based on these headings.

High school and college level essays in a variety of disciplines, taken from the homework help service IvyPanda. Based on the text of the essays, ChatGPT compiled a prompt based on which he wrote the essays.

The authors of Ghostbuster posted their datasets on

GitHub repositories

so for testing Binoculars simply took the same data.

Dependence of document size on the result for datasets (from left to right) news, data from the /r/WritingPrompts subreddit and essays from IvyPanda. As the amount of information increases, detectors respond more accurately. Binoculars shows superiority over Ghostbuster with a small number of tokens

In addition, our own datasets were compiled for Binoculars. To do this, we took texts written by people from CCNews (DOI:10.18452/1447), PubMed (doi:10.1609/aimag.v29i3.2157) and CNN (arXiv:1506.03340), and then gave LLaMA-2-7B and Falcon-7B to write alternative versions of the articles. The first 50 tokens were extracted from the samples and fed to the BNM in the form of a prompt to obtain 512 response tokens. The human prompt was then removed from the text.

Comparisons on three datasets were carried out against GPTZero, Ghostbuster and DetectGPT. Datasets based on ChatGPT are more familiar than GPTZero and Ghostbuster, since these detectors are designed to detect ChatGPT. Likewise, DetectGPT, due to its nature, is predisposed to perform better on LLaMA.

Comparison of Binoculars with competitors on texts generated by LLaMA-13B. As you can see, Ghostbuster is only able to detect ChatGPT

Indeed, Binoculars surpasses Ghostbuster outside of the latter’s “usual” areas. The authors of Binoculars note that this scenario is the most realistic and includes data not included in the Ghostbuster training datasets.

Detection of ChatGPT generated text in various areas from the M4 dataset. Horizontal: completeness, that is, the proportion of positive cases that were detected. Vertical: accuracy, that is, how many positive answers were correct

We also checked the texts of people for whom English is not their native language. The dataset of essays from the EssayForum website has already been collected and published. This dataset contains original essays and texts with corrected grammar. Unlike some detectors, Binoculars maintains high accuracy for both essays with errors and corrected versions.

Binoculars treats both versions of essays as human text

The proportion of false positives of Binoculars remains low even for languages that are poorly represented in Common Crawl, on whose datasets BNM is often pretrained. But Binoculars often recognizes machine text in these languages as human text. Probably, if the current version of Binoculars were built not on two models of the Falcon family, but on something more powerful, then performance in these languages would increase.

However, as the researchers admit, the launch of larger (30 billion parameters and higher) models was prevented by memory limitations of their video accelerators.

Binoculars works with high precision for Bulgarian and Urdu and low recall for all four languages, including Russian

The question also remains of what to do with memories in language models. Any such system will recognize the text of the US Constitution as written by the BML, although in fact it was written by the delegates of the Philadelphia Convention, and not by some language model in 2024. On the other hand, this is the desired behavior for anti-plagiarism systems and, in general, is primarily a question in the terminology of the tool interface.

On the contrary, a sequence of completely random tokens will be recognized as human text – even though it was written by a machine. Obviously, BYM will not generate random tokens.

None of the experiments included software code. The authors of Binoculars also warn that they did not undertake issues of bypassing such detectors.

A preprint of the scientific article “Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text” was published on the preprint website arXiv.org (

arXiv:2401.12070

). In the GitHub repository

github.com/ahans30/Binoculars

The project code is presented under the 3-clause BSD license. On the page

huggingface.co/spaces/tomg-group-umd/Binoculars

An instance with a web demonstration of the detector’s operation has been launched on Hugging Face.