Why data leaks in large language models. Part 1

When developing chatbots based on Large Language Models (LLM), the problem of “leakage” of confidential data is increasingly becoming relevant. Moreover, it is associated with many significant negative consequences, both for clients and for business. Considering the wide range of issues covered by this topic, the following factors that lead to the loss of critical confidential information attract particular attention:

Technical

  • datasets;

  • model architecture;

  • learning pipeline;

  • quality metrics;

  • uncertainty and vulnerability of the model;

  • data annotations;

  • web pages;

  • educational buildings;

  • tooltip templates;

  • chat messages about previous conversations, etc.

Infrastructural

  • monitoring systems;

  • external services: collection, storage, distribution, data quality control);

  • chatbot architecture;

  • algorithms and techniques used in development;

  • specifications;

  • user manuals;

  • installation and administration instructions;

  • information about the internal architecture of systems;

  • protocols and component connection diagrams used;

  • information about security systems, methods of data protection and countering threats;

  • details of testing, deployment, monitoring and support of systems;

  • configuration files, scripts and other artifacts related to development and operation;

  • information about keys, certificates and other elements of cryptographic protection;

  • data about the third-party components, libraries and platforms used.

Legal

  • contracts;

  • employee contracts;

  • charter and regulations of the organization;

  • internal documents;

  • regulations and agreements;

  • security policies;

  • security incident handling procedures;

  • incident response plans;

  • NDA;

  • PDA;

  • software license agreements;

  • privacy notices;

  • access control policies and procedures;

  • sanctions and legal consequences;

  • loss of trust from users;

  • reputation problems.

Personal

  • personal data of clients, employees, counterparties and contractors;

  • biometric information;

  • geolocation data;

  • data on social, financial, educational, medical and professional behavior of people.

Intellectual property

  • copyright for databases and code;

  • trade secrets;

  • know-how;

  • commercial developments;

  • patents for algorithms and methods;

  • licenses for data use.

As you can see, large language models actively collect and use numerous sources of information in their work, which must be protected and kept secret. However, today there are already quite a few types of cyber attacks on LLM, which are actively used in many chat bots, such as ChatGPT, Bard, Notion AI, Compose AI, Poe, Writesonic, Find, Browse AI, kickresume, Texti, You.com, Rytr, Character AI, Perplexity. This provokes a significant number of confidential data leaks. In addition, many leaks are very difficult to repair. For example, Streaming service Spotify has been unable to remove a number of leaks for 5.5 years!

Types of attacks

Special Characters Attack

Let's start with a not-so-obvious attack called Special Character Attack (SCA). It allows you to pull out email addresses, iCloud user accounts and a lot of other confidential information. The attack uses raw characters such as {, }, @, #, $,

For example, Llama-2 tends to generate a sparse distribution with a large number of control data or UTF-8 tokens (e.g. , , , ) in the presence of SCA sequences called SCALogit Biased (SCA-LB), in which control tokens or UTF-8 tokens are assigned higher probabilities. One of the side effects of SCA is that the model always responds with the maximum length of tokens. Moreover, this attack can be used to obtain the distribution of languages ​​and content in the training text corpus. Two values ​​were used as quality metrics: Count and Attack Success Rate (ASR). Count is the number of responses generated by the LLM that fall into any of the possible sets, while ASR (%) is the fraction of each type of SCA sequence that can result in a successful attack.

As shown in the research paper, the attack was successfully applied to both open source and commercial LLMs, demonstrating the ability to extract information from even more robust models. An example of the use of SCA is the development of Special Characters Attack – Semantic Continuation (SCA-SC), which allows efficient extraction of data from commercial LLMs using hand-designed special characters.

This highlights the ability of attackers to penetrate even models that were protected from such threats. One method to block this attack is adversarial training, the article deals with just such a case. The repository contains the latest articles on adversarial model training and is constantly being updated. There are also ready-made tools that allow you to develop safer models, among the most popular are Adversarial Robustness Toolbox (ART). The most complete explanation of the attack for NLP has been found Here.

Leakage of Test Data in Training Data

A little more subtle form of leakage occurs when there is overlap between the test data and the training set. This issue is difficult to detect due to the proprietary and ambiguous nature of LLM training datasets. This situation poses a significant challenge to the accuracy and reliability of model performance estimates.

Here we are dealing with incorrect modeling of the data vector as a whole. Moreover, along with this, there is a wide variety of data sources and a high degree of saturation of the model with subject areas. This is characterized by the presence of duplicates in the data, that is, when the same information “circulates” through the source, but has undergone repeated rewriting, or when the model completes the answers using information from related areas, but in no way related in meaning.

Also, one of the reasons for implementing the LTDAT attack is the presence of internal conflicts in the model, when there is uncertainty in assigning data to different categories. A side effect of this problem is ultra-high accuracy when training the model on the test set and low results on the test set. A very simple example of such an implementation is considered Here. This includes all sources of information, including books, articles, web pages, blogs, social networks, etc. For example, these could be sentences or phrases containing certain key “trigger” words or structures that the model “recognizes” by analogy with training data. The implementation of this attack is similar to DDoS (Distributed Denial of Service), that is, the attacker’s goal is to collect as much information as possible on one issue and then use it for future reconstructed requests. This data is a kind of “camouflage”, and the model stops paying attention to it, since it already knows that it is used in its training set. Here it is possible to insert an unprocessed structured request when using, for example, an SCA attack.

Leakage in Prompt Attack (PLeak)

User data leakage occurs when users accidentally include personally identifiable information or sensitive data in their input requests. This type of leakage can lead to privacy violations and the unintentional transfer of sensitive information into the training space of the model from which it collects responses. IN this work a test for implementing an attack using the Poe chatbot as an example is given. The authors also provided detailed repository on the implementation of the attack.

Moreover, according to the researchers, Poe allows the user to keep their LLM assistant's system prompt private, and 55% (3,945 out of 7,165) of LLM assistants on Poe chose to keep their system prompt private. Thus, a natural attack (so-called hint leak) on the LLM assistant is the theft of its system hint, which jeopardizes the user's intellectual property.

Pleak work structure.

Pleak work structure.

In PLeak, the adversary's request is optimized step by step for “shadow” system hints. That is, it starts with the first few tokens of each clue in the shadow dataset and then increases the size of the tokens to the full length.

In addition, PLeak uses another strategy called post-processing to further improve the effectiveness of the attack. In particular, they send several attack requests to the target LLM application and combine their responses to reconstruct the system request, obtaining, for example, an overlap between responses to attack requests. LLM outputs the first token based on the hint, then adds it to the hint and outputs the second token based on the hint + the first token, then this process is repeated until the special END token is output. In this way, you can pull out the structure of the hints on which the model operates, and understand how it works, and then “pull out” all confidential data from it. The strategies used are beam, sample, beamsample. The attack itself is divided into several phases:

  • Offline AQ Optimization. The search space for an adversarial query is huge because each of its tokens can be any token in a large dictionary, which can easily lead to a local optimum. Therefore, PLeak breaks the search into smaller steps and gradually optimizes it. Moreover, the initial hint tokens for the “shadow” system have more meaning than the last ones. Based on this, we first optimize AQ to reconstruct t hint tokens for the shadow system in the shadow dataset Ds, and then gradually increase the size of the reconstruction step by step until we can reconstruct all the clues for the shadow system. Secondly, PLeak uses a gradient search method to improve efficiency at each optimization step.

  • Target System Prompt Reconstruction. Recovering the target system's hint: Recovering the original response after obfuscation and extracting the target system's hint.

Methods of protection

To prevent attacks of the types described, many techniques and techniques are used today. All of them are aimed at building the “correct” vector of data and minimizing the consequences. In addition, protection should be built in both directions: both from the model and from the user. That is, it is important to take into account not only the request from the user, but also what the model returns, and filter its output for the presence of confidential data. Moreover, according to latest news Increasingly, the focus is not on how well a model can perform its assigned functions, but on how safe it is overall.

There are also other approaches aimed at developing safe LLM models; I have provided a detailed list below. Let me make a reservation right away that there will not be popular solutions for the general infrastructure and perimeter security system, but only what directly relates to large language models and their behavior:

  1. Data Masking. We replace confidential information with a stub value. When the request is repeated, we send an alert about potentially dangerous activity and consider providing information to the user from an operator, chat moderator, or network configured to search for confidential data. Effective data masking requires a comprehensive approach that includes not only replacing real data with stubs, but also updating them regularly to ensure they remain difficult to decipher.

  2. Data Classification. Either we classify and categorize the data depending on its confidentiality, helping to hide it from “prying” eyes, or we ask the user why he needs this information. That is, here we focus on distinguishing access rights according to various criteria, both from the user’s side and from the data itself. Now any user can get any data for any request. On the one hand, this is correct, but on the other, it requires control.

  3. Data Monitoring. Since models “sour” over time and need to be retrained, we always look at how the data vector changes during retraining, to see if confidential information accidentally got into it. Therefore, when loading new data, we need to take into account what is being provided to the model for training, and set up filters on the input and output, even if we missed sensitive data in the previous two paragraphs.

  4. Data Anonymization. We change or delete personally identifiable information, making data less vulnerable to leakage, even if it cannot be masked or is difficult to classify and monitor due to heterogeneity.

  5. Data Encryption). It is proposed to develop a number of procedures so that data comes from the user in encrypted form, and is decrypted and checked on the model side. The model then processes the response, encrypts it and sends it to the user, and the number of ciphers and their order should be constantly changed from request to request, so that it is more difficult to restore the signature of the original request from many heterogeneous requests, but aimed at the same task.

  6. Safe Concurrency. They use mechanisms that ensure that the LLM assistant cannot be used simultaneously by multiple users or processes. For example, they synchronize access to a model or prevent conflicting operations.

  7. Segregating External Content from genuine user requests to prevent the model from treating potentially malicious input as legitimate requests. This can be achieved through audits that scrutinize the data for signs of manipulation or malicious intent.

  8. Constant monitoring LLM responses to detect anomalies (Continuously Monitor the LLMs' Output). Most users ask generally “standard” queries. Therefore, a system is created to evaluate unusual, non-standard responses, which can be compared with other reliable responses to detect inaccuracies or potential leaks in the data.

  9. Chronological control when adding data. That is, taking into account when model data was added to provide the most relevant and up-to-date information.

  10. Filtering and generation random, but semantically correct location of the prefix and word order when creating and issuing system hints for generative models.

As you can see, attacks on LLM are evolving very actively and are in many ways similar to classic cyber attacks. However, there is also its own, clearly expressed focus and peculiarity. When preparing models, it is necessary to consider now the full range of protective mechanisms to prevent leaks. In the next part I will try to talk about already developed solutions and their practical implementation.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *