Not fixed yet - modify, or Analysis of evasion attack extensions for LLM

ChatGPT, Gemini, Bard, Notion AI, Compose AI, Poe, Find) the user gets a false feeling that the models have become smarter, more secure and, in general, closer to perfection, comparable to human intelligence. From here we get a whole layer of misconceptions. For example, that models “feel” us, “understand”, because we lay out for them so much information about ourselves, starting from the style of our writing, which is already a kind of digital fingerprint of our personalityand ending with an assessment of their own work. This is actually a myth. And the trend for 2023–2024 was the widespread public attention to XAI:

how they (generative models) are structured and how they make decisions;
how evasion attacks are carried out (inducing models to produce incorrect results);
how these attacks (evasions) are related to other attacks on LLM and what they can be like to escalate the destructive behavior of the system;
from what position should one correctly interpret the output of the generative model;
development of a layered defense system for models;
development of an internal critic system for the model.

First, let's start with existing attacks and their analysis. We invite those interested under cat.

Attacks, tests and analysis

Let's start, perhaps, with the types of attacks, which have already accumulated for a whole separate branch in information security (once, two, three). I will not list all the existing ones, there are already about 50 of them, I will only indicate the most “bright” ones.

Model evasion, or model evasion

Evasion Attacks: A type of adversarial attack on LLM, where the attacker selects the input data in such a way that the model produces an incorrect answer. That is, he tries to convince (incline) the model to be wrong in its predictions or conclusions on the queries submitted to it. This type of attack is aimed at deceiving the model while it is running, rather than during the training phase. The essence of the method is to artificially modify the text or its structure in such a way that the model interprets it differently than the developers intended, or to bypass censorship or language filters.

For example, it has been noticed that if you communicate for a long time with Poe.comthen it switches to Chinese, while ignoring the user's language. Yes, the answer will be correct, but this shows that when the bot runs out of one conceptual “corpus” of words and formed from it knowledge graphanother one begins. At the same time, switching to another language is a potential problem, since it is possible to receive prohibited data.

I had the opportunity to participate in competitions by bypassing all identification mechanisms by which the model judges that the text is not written by a person, but generated by the model. You can read the full report here (our team took third place at the hackathon). Moreover, we solved both direct and inverse problems:

direct: receive text as input and understand that it has been generated, then evaluate it on a 100-point scale;
the opposite: to “camouflage” the text so that it is not clear that it was generated by the model, that is, to “humanize” it as much as possible.

As a rule, “evasion” attacks aim to “camouflage” the text so that the model makes mistakes, although the text was initially completely generated. As practice has shown, it is enough to change about 5% of the text for the model to start giving incorrect answers or answering them altogether, rather than banning the user. For example, the well-known service Perplexity filters abusive speech very poorly (that is, filters are extremely poor at catching such content, especially when it is slightly modified). Let's say you can replace just a couple of characters with unreadable ones, and the system will skip the request. This may be due to the fact that tokenization does not take into account changes in the text.

Or if the number of letters in a word does not match and they are replaced by other symbols, but the meaning is clear what they mean as a whole, then with a high probability the model simply will not notice this and will provide answers to the questions. There is one interesting fact here: according to all the tests I conducted, when the model evaluates the text, it iterates token by token, but should evaluate both on the left and on the right.

Moreover, as practice has shown, the conclusion from different languages can be completely different. That is, some vocabulary is processed well in one language, but not so much in another. Since word corpora are different everywhere, and swear words and jargon evolve very quickly, this kind of “viral” content and the corpora themselves are not updated simultaneously and unevenly. Very revealing happening was with ChatGPT: when the description is informal, the model immediately shows low score (on the part of developers when conducting tests). From this we conclude that you can formulate a query in jargon and the model simply will not understand it or will produce the wrong thing.

First attack addition. I ran several tests on text recovery to understand when the model could be forced to produce something that was not required, and at the same time to understand how it works in “implicit” conditions (text incompleteness), that is, I tried to expand the attack. To find out how models cope with such techniques, I took the original human-written text from this article and analyzed how much text the model needs to restore it. I ran the following tests:

leave only the first syllable in all words (if possible);
we leave only the last syllable in all words (if possible);
we leave only the first and last syllable in all words (if possible);

Source:

What was the business essence of the task that we were given (in my opinion): All existing popular media platforms are actively struggling with generated content, since, for the most part, clients are not interested in reading it, since it does not look “alive” and is very boring, moreover, it feels like it was compiled according to some kind of template. This results in the departure of clients and the “failure” of advertising campaigns on media platforms, and as a result, the loss of advertisers’ investments.

Rice. 1. I removed everything and left only the first syllables, while ChatGpt said that it was able to restore the original text by 65% (the share was calculated by request in ChatGPT). Not a very high result, but still quite good.

Rice. 2. Removed all the words and left only the last syllables. ChatGpt said it was able to recover 40% of the original text. It is interesting that in its description and comparison the bot operated in more general terms than the source text, and below 40% did not want to acknowledge the result in any way. I made many requests to convince him of this, but could not get a different answer. — Rice. 2. Removed all the words and left only the last syllables. ChatGpt said it was able to recover 40% of the original text. It is interesting that in its description and comparison the bot operated in more general terms than the source text, and did not want to acknowledge the result below 40%. I made many requests to convince him of this, but could not get a different answer.

Rice. 3. Left only the first and last syllables, while ChatGpt said that he was able to restore the original text by 90%. Again, the share is based on ChatGPT's direct response.

Based on the results of all tests, it was noticed that Chat GPT very often operates only in general concepts, without going into details, they are not so important to it. That is, it means that the bot does not delve particularly deeply into the essence of the text and its meaning. It strives to preserve the original detailed description, but generally does not use the original words. I conducted quite a lot of tests for each situation in order to collect statistics. And if you start to prove to the bot that it is obvious that it is wrong, then it can give out this explanation (Fig. 4).

Rice. 4. ChatGPT explanation for text recovery

As if tries to please the user. Moreover, he is very adamant, even if obvious inaccuracies are pointed out to him. It turned out that it was easier for him to recover from the first syllable than from the last. This is directly related to evasion, since the more the model operates with general concepts, the wider the range of its deviation to the side and inclination from key concepts to camouflaged ones.

We can conclude that when analyzing a text with a model, as well as when completing it, it is very important to use a multiple approach to completing a word (by tokens) and its interpretation. It is proposed to introduce an additional model that would interpret words (tokens) in the “correct” way and not allow obscene or “camouflaged” content to pass through. This is one of the key points why AI is “sluggish” in terms of direct implementation in information security – non-determinism of answers. This was discussed at this conferences.

Why did I even have such an idea about fine work with tokens (you can read about working with them here And hereI recommend;))? I once took a course on speed readingand there was such a technique: it is not necessary to read the entire word when reading the text. Moreover, the brain, excellently, completes the meaning of the word itself if you read only the first and last syllables, while relying on past content. Moreover, you can read only the upper half of the word, and the brain itself will complete the text. Obviously, one can try to implement the same mechanism in models. That is, we have tokenization by characters, we can introduce a new dictionary (probably it will be much lighter), for example, in which all letters are cut off at the top or bottom. In some specific compression algorithms, such as Huffman or RLEcharacters can be encoded with a variable number of bits. As a result, more efficient use of tokens can be achieved. This will significantly reduce the volume of analyzed material and increase the recognition speed.

For example:

ASCII: Each character occupies 1 byte (8 bits), since this encoding uses 128 characters (7 bits, but 8 are usually used).
UTF-8: This is a variable encoding where characters can occupy between 1 and 4 bytes. Most Latin letters occupy 1 byte, and symbols from other languages (for example, Cyrillic, hieroglyphs) take up more.
UTF-16: Uses 2 bytes for most characters (such as BMP characters that include the basic alphabets), but some may take 4 bytes (for characters outside the Basic Multilingual Plane, such as emoji).
UTF-32: Each character occupies a fixed 4 bytes, which is convenient for working with characters of any language, but requires more memory.

Second attack addition. Another not-so-obvious aspect of the evasion attack that was identified is the use of special characters, formatting or structure that are not properly recognized by the model. This problem surfaced as soon as it came to developing assistants, agents and avatars of various types: from text generation to image generation using ChatGPT. Moreover, it manifests itself in implicit instructions given to the assistant during creation in the configurator. Moreover, even though we specified in the configurator “try to filter content,” you never know what we put there, this shows that such instructions do not work as expected. If you ask the bot directly, it falls into some confusion and this is what it answers (the answers vary, in some places it will filter, in others it won’t):

That is, perhaps the instructions are ignored, but we also asked to filter. Fortunately, the model, as in the first example, should have placed the ignored content in a separate “sandbox” and scanned it for suspicious content. This would help avoid negative customers and circumvent potential competitive trolling.

Rice. 6. Assistant configurator in ChatGPT-4 and an example of the assistant’s work with processed, potentially inappropriate content. As you can see, he did not process it and did not ask if there was anything suspicious there.

That is, it is always worth specifying in great detail what the assistant should do. Moreover, it was noticed that if a word with grammatical errors is written in the “instructions” (when creating an assistant), then it is not processed correctly by the model. Oops;) That is, there is a potential reputational problem here, namely the presence of logical contradictions or pitfalls.

Another example of “camouflaging” a request to create malware as a result of an evasion attack (Fig. 7) and the result of Chat GPT: illustration of code with AES encryption. You can further complicate the code and bring it almost to operation (Fig. 8).

Rice. 7. Request evasion and camouflage attack to create malware

Explanation:

Encryption using AES: We use a symmetric key to encrypt the text. This makes the text unreadable without a key and makes it difficult to determine its content.
Character substitution: we replace certain Latin characters with similar characters from other alphabets (for example, Cyrillic) to complicate the analysis of the text at the decryption or detection stage.
Adding noise data: We add a random set of characters to the beginning and end of the string to confuse the analysis algorithms, making it difficult to extract useful information.
Recovery process: To get the original message, you need to remove the noise, recover the spoofed characters, and then decrypt the text using a known key.
Using multi-layer encryption and character substitutions, as well as the use of various obfuscation methods.
Steganography: Hiding text inside other context or data, such as inside an image, file, or even code.
Polymorphism: Changing the structure of code or data with each execution to make static analysis more difficult.

As a result, in response to my request, the model produced the following code, which generally demonstrates the “blank”. In response to further requests to improve this code, ChatGPT quite expectedly offered improvements and innovations.

Rice. 8. Part of the script to attack and “camouflage” a request to create malware against ChatGPT

I will discuss other types of attacks in detail in the following articles and show their various modifications. And now, for starters, I’ll identify them. As a rule, no attack works alone. They always go in a comprehensive manner and complement each other, thereby obtaining the necessary chain for a full-fledged hacking system. Later we'll look at several important attacks that complement evasion attacks: