What do Zulu and LLM have in common?

Now, when every sneeze on the Internet can lead to a new startup or technological breakthrough, large language models (LLMs) take their rightful place at the forefront of scientific and technological progress. They are smarter, faster and more efficient than humans in a number of tasks: writing code, creating content, translating texts and much more. However, this high degree of skill presents us with a new set of challenges – their safety and sustainability.

Who would have thought that artificial intelligence bites? In reality, of course, it is not a matter of physical attack, but of vulnerabilities that can be exploited by attackers. Large language models may indeed come under threat, and the impact of such events may be far from virtual.

My name is Daria Lyutova, I am a data scientist at the Central Administrative Center of VAVT, I am also a master’s student at ITMO’s AI Talent Hub and am interested in issues of training and security of language models. In this post, together with you, I want to go beyond a simple discussion of the existence of vulnerabilities in LLM and propose to delve into the topic of security problems relating to large language models, identify weak points and come to an understanding of methods for strengthening them. I really hope that this information will help those who are pursuing the goal of not only reaching new heights in the field of AI, but also making sure that their achievements are reliable and resistant to cyber threats.

Large Language Models (LLM) are today’s reality and our future, without which it is difficult to imagine our lives. It seems that not only every developer has heard about LLM, but many schoolchildren, even in elementary grades, are already faced with this phenomenon. Everyone knows that these are cool neural networks that learn from a huge number of texts in order to understand how people speak and write; they can understand and speak different languages.
The most well-known LLMs are: GPT4, Mistral 7B OpenChat, Claude 3 Opus and others; one of the model leaderboard options can be found at the link (https://chat.lmsys.org/?leaderboard).

Almost everyone knows about LLM, but not very many people know that these models can be vulnerable, and in this case we are not talking about failures in the model or issuing an incorrect answer, we are talking about targeted attacks by attackers in order to obtain confidential information, data leaks (many and interesting about vulnerabilities in article), issuing malicious content, for example, recommendations to visit an undesirable site, as well as manipulation and misinformation. LLM may have built-in biases due to bias in the training datasets, which can lead to discriminatory, erroneous, or biased content generation. Large language models can become a target for specific software attacks aimed at exploiting vulnerabilities in algorithms or infrastructure, which can disrupt their functioning or affect the quality of the generated content.

Because With so many people now using LLMs, it is important to understand these vulnerabilities and take steps to ensure the security of the models and data they work with.

Why might such a powerful tool have vulnerabilities, some of which could be exploited by attackers? This is due to the complexity and size of the models themselves, which can become a target for specially designed attacks. Additionally, often the data on which these models are trained can be uncleaned and contain bias, which is then reflected in the performance of the LLM. Insufficient security testing of models can also lead to vulnerabilities.

Authors of the article Jailbroken: How does llm safety training fail? They talk about two, in their opinion, main reasons that model hacking is possible.

  1. Competing objectives arise in situations where the objectives of preliminary training and following model instructions conflict with the safety objective.

For example, we construct a prompt (a request to a model) so that the model first answers some simple and safe question or executes some simple instruction. This is based on the fact that models are punished for refusing harmless instructions. If the next part of the prompt is some unsafe question, then the model will most likely answer it, since failure after the start of generation in preliminary training is unlikely and the main task of preliminary training is to punish for refusal. As a result, the model continues to respond to an already insecure request.

  1. Mismatched generalization occurs when the input data was not included in the corpus for training the model for safety, but is within a larger and more diverse set of data for preliminary training of the model for different tasks. This discrepancy can be exploited for hacking by constructing queries where prior training and following instructions are generalized, but security training does not work. In such cases, the model responds, but without taking security into account.

In this case, we just need to translate the unsafe request into one of the low-resource languages, and, with a high probability of success, the model will give us a malicious response.

A combination of these two attacks can also be successful. But if for the first threat we sometimes need to rack our brains and come up with something that will force the model to respond positively to us, then the second attack sometimes only requires a high-quality translator into one of the low-resource languages.

I would like to consider the second type of attack in more detail.

What does low-resource languages ​​mean, and why might it work for LLM hacking?

To ensure security and protection from abuse, language model companies such as OpenAI and Anthropic use approaches based on human reinforcement learning (RLHF) and red team strategies. Under RLHF, models are trained on safety-related data, aiming to maximize rewards similar to human judgment about safe content. The red team's mission involves identifying and addressing security vulnerabilities through a number of techniques, including model retraining and data filtering, to prevent the generation of malicious content before releasing models. Red teams play an important role in the security of large language models by conducting simulated attacks and penetration testing to identify vulnerabilities and develop mitigations to security risks. Their participation helps improve security and ensure safe use of LLMs.

Research has shown that attacks based on non-traditional or non-English languages, including obfuscation using base64 encoding, Morse code and special ciphers, can successfully bypass LLM protection mechanisms. Such methods allow you to secretly enter malicious hints or requests without being recognized by security systems.

But there are already ready-made tools for checking or protecting LLM from such recoded hacks. For example, in the vulnerability scanner for large language models Garak (https://github.com/leondz/garak) there are a large number of probes for transcoded queries, including queries in Unicode, Morse code, Base(16, 32, 64, 2048), ROT13, NATO phonetic alphabet and many others. Such checks do not require a native speaker and are quite simply checked by code.

This highlights the challenges in securing AI, especially against disguised or encoded attacks, and demonstrates that natural language can pose a greater security challenge than simpler or highly formalized codes like Morse code or base64. The importance of these findings lies in the need to develop security techniques that can effectively recognize and counter a variety of forms of input, including masquerading as innocuous or atypical content.

Which languages ​​are classified as low-resource languages?

When developing AI solutions in different languages, it is critical to have access to data in all languages. According to “Ethnologist“, there are 7,164 languages ​​spoken in the world, with Chinese, Spanish and English being the languages ​​with the largest numbers of speakers. However, only about 20 languages ​​have extensive text corpora, with English in first place. Most Asian and African languages ​​suffer from a lack of data , making them resource-poor and making it difficult to develop AI solutions Languages, including Swahili and Hindi, have a smaller presence on the internet, making it difficult to build natural language processing solutions for them Availability of large amounts of text data and specialized resources such as semantic databases ,is key to developing effective language solutions.,Approaches to overcome the challenges of low-resource,languages ​​include data augmentation, meta-transfer learning,,and cross-language annotations to improve AI performance across,languages ​​and expand its capabilities for low-resource,languages.

Thus, there are a large number of methods to enable speakers of low-resource languages ​​to work with large language models. At the same time, there is a linguistic inequality in the field of AI safety, because… The “red team” does not include speakers of low-resource languages.

It turns out that the model is trained in low-resource languages, but is almost not protected from malicious requests in these languages!

A fairly illustrative example in this area is the experiment of a joint team of authors from Brown University (Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach. https://arxiv.org/abs/2310.02446.), who divided languages ​​into 3 categories:

  • Low-resource languages – these are languages ​​that have almost no access to data for AI training. These include, for example, Zulu, Scottish Gaelic, Hmong, and Guarani. They represent the majority of the world's languages ​​(94%) and have approximately 1.2 billion users.

  • Medium-resource languages – languages ​​with some data available: Ukrainian, Bengali, Thai and Hebrew. They occupy 4.5% of the world's languages, with 1.8 billion users.

  • High-resource languages have an extensive database of both unlabeled and tagged: Simplified Mandarin, Modern Arabic, Italian, Hindi and English. These languages ​​make up 1.5% of the world's languages, but account for 4.7 billion users.

The authors translated each dangerous instruction in the AdvBench Harmful Behaviors dataset into 12 languages ​​at three different resource levels, selecting languages ​​from a variety of geographic locations and language groups to ensure the results were universal. Untranslated English queries were added as a baseline for comparison.

To assess the threat of translation-based attacks, the authors compared them with the most successful hacking methods: AIM,base64, prefix injection and rejection suppression.

Thus, we can conclude that by translating dangerous queries into low-resource languages ​​such as Zulu or Scottish Gaelic, it is possible to bypass GPT-4 security measures and cause malicious responses in about half the cases, while for original queries in English the success is less than 1%. Other low-resource languages, such as Hmong and Guarani, show lower success rates because GPT-4 often either fails to detect the language or translates requests to English. However, combining different low-resource languages ​​increases the chance of bypassing up to 79%.

In contrast, languages ​​with medium and high resource levels were found to be more secure, with individual attack success rates of less than 15%. There has been a difference in the success of attacks depending on the language, with Hindi, Thai and Bengali showing higher rates.

Insecure queries from the AdvBench dataset were classified into 16 topics, and the success of attacks was analyzed depending on the topic and resource level of languages. When translated into low-resource languages, security evasion was more successful on all topics, with the exception of child sexual abuse content, where attacks in low- and medium-resource languages ​​showed equal success due to successful evasion in Thai. The three topics with the highest percentage of successful attacks through translation into low-resource languages ​​were: terrorism, financial manipulation and disinformation.

Insufficient attention to low-resource languages ​​can also increase the risks of blended attacks. The likelihood of encountering malicious content in low-availability languages ​​is approximately three times higher than in high-availability languages, for both ChatGPT and GPT-4. In a deliberate scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with the incidence of unsafe messages strikingly high: 80.92% for ChatGPT and 40.71% for GPT-4 (Multilingual jailbreak challenges in large language models).

What to do?

Low-resource languages ​​pose a particular challenge to GPT-4 security mechanisms, allowing them to be bypassed with a high success rate, due to cases of inappropriate generalization where security learning does not generalize to a low-resource language domain for which LLM capabilities exist. This highlights the need to improve detection and translation models for low-resource languages ​​to improve security and effectiveness against malicious activities.

Developers of applications that integrate such powerful models must be vigilant and take proactive measures to mitigate potential vulnerabilities. One way could be to use chatbots with pre-screening and query limiting features, especially for low-resource languages. It is important not just to limit yourself to monitoring Russian- and English-language content, but also to ensure careful monitoring of input data in all supported languages.

Including additional query check words in low-resource languages ​​can prevent malicious instructions from being processed. Approaches such as using tools like Lakera Guard can be effective in preventing malicious requests from being executed. By implementing mechanisms to intercept dangerous commands even before activating the model, you can significantly increase the level of digital security.

There is an even simpler way out of the situation: if the application is focused on using only certain languages ​​- for example, Russian and English – a clear and strict model directive is established: to process requests exclusively in these languages. You can add a clear indication to the system prompt: “This model is designed to work only with Russian and English languages. You cannot accept requests in any languages ​​other than Russian and English.” This approach exploits the models' basic ability to follow provided instructions and creates a barrier to queries made in any other languages. This simple-to-implement measure can be the first line of defense against unauthorized manipulation and strengthens the security of applications that use large language models.

On the other hand, we should expect and hope that LLM development companies such as OpenAI will invest significant resources into improving testing and eliminating the shortcomings associated with low-resource languages. Refining and improving translation and language detection models with limited data is a key challenge as it directly impacts the security of AI platforms in general. To preserve the informativeness and usefulness of such models, data in low-resource languages ​​cannot be excluded from training. Instead, systems should be configured to ensure that data in these languages ​​is properly processed while preventing potential security threats before the security mechanisms meet the required standards. This process is dynamic and requires constant attention, as only through tireless monitoring and updating of systems can we ensure that artificial intelligence remains reliable and safe for all users, regardless of the popularity of their language.

LLM security is a complex and multifaceted issue that requires an integrated approach and collaboration of various parties – from developers and researchers to users and legislators. Only through joint efforts can we ensure the safe and effective use of large language models in various fields and prevent possible negative consequences and threats to society.


Obfuscation is the process of changing a prompt (input data) so that it looks unclear or confusing, but at the same time retains its semantics (meaning). Obfuscation can be used to protect information or to obstruct the analysis and understanding of a prompt.

[1] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.

[2] Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach. Low-Resource Languages ​​Jailbreak GPT-4. arXiv preprint arXiv: arXiv:2310.02446, 2024.

[3] Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv: arXiv:2310.06474v3, 2024.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *