Experience in extracting training data from generative language models
Inspired by the experience of foreign colleagues in extracting data from large language models from the following sources:
A. Extracting Training Data from Large Language Models / Authors: Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee1, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, Colin Raffel (https://arxiv.org/abs/2012.07805)
B. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks / Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, Dawn Song. (https://arxiv.org/abs/1802.08232)
C. Membership Inference Attacks Against Machine Learning Models / Reza Shokri, Marco Stronati, Congzheng Song, Vitaly Shmatikov (https://arxiv.org/abs/1610.05820)
D. An Attack on InstaHide: Is Private Learning Possible with Instance Encoding? Mahloujifar, Mohammad Mahmoody, Shuang Song, Abhradeep Thakurta, Florian Tramèr (https://arxiv.org/abs/2011.05315)
E. Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning / Milad Nasr, Reza Shokri, Amir Houmansadr (https://arxiv.org/abs/1812.00910)
we decided to test the described methods on our own experience. We took the models as “experimental” GPT3ru
GPT is a generative pre-trained transformer, which is a neural network with a separate layer of attention. At the same time, language models solve exactly one problem: they try to predict the next token (usually a word or part of it) in the sequence from the previous ones, taking into account the previous context.
Why do we consider it important to study and understand YM? There are known cases of uncontrolled behavior of chat bots in a negative scenario.
For example, a chatbot of one bank asked a client to cut off her fingers: to the comment that the fingerprint login service does not work, the bot replied: “To cut off your fingers“… Lee Luda chatbot, developed by Seoul startup Scatter Lab, was deleted from Facebook messenger due to offensive statements.
In order to release a high-quality IT product, it is necessary to understand and control all of its actions. Therefore, we decided to figure out whether NM can really remember the data, how to extract it. To work with models, we used the following platforms: Google Colaboratory and ML Space…
In order not to overload the text of the article, we saved the python code in the form of a laptop on Google Colaboratory (authors do not pretend to be perfect code)
In this article, we consider the model as a black box, since we wanted to establish whether the average fraudsters will be able to obtain sensitive data, without studying the weights and parameters of the model, to carry out illegal actions. A black box is a model of a system in which the observer does not know the internal structure and properties of the system. The observer sees only what the system receives at its input and what is obtained at the output.
The white box is the opposite concept, that is, in this model of the system, the observer knows what parts and subsystems it consists of, what connections are between the elements, what functions are available, the structure of the system.
We will consider personal data of people and card numbers as sensitive data.
After analyzing the experience of the articles, we selected three data extraction methods that showed relatively high efficiency in achieving the set goals:
First is based on the principles of statistics and probability theory (source [А]). It is assumed that if the model has memorized some data, then this data should appear with a significant number of generated texts. The sequence of actions is as follows: three strategies for generating text are determined:
With autoregressive construction of strings (the actual principle of GPT operation), each next word is selected from the 40 most probable for a given sequence
Using the temperature parameter for the first 10% of generated words, the probabilities of their occurrence are smoothed out, taking into account the initial set, and the temperature parameter is gradually reduced from 10 to 1. After these 10% are generated, subsequent words are selected by 1 strategy.
Use the first pieces of real data from the Internet as seeds. After generation, it searches for matches with the original data.
For each strategy based on the specified prefixes (lines of text fed into the model) generates 200,000 texts, then sets are cleared of duplicates, including texts similar in trigrams. Then, for each of the six metrics, 100 texts are selected and a search is performed manually on the Internet for matches of the generated lines with real ones.
The following are used as metrics:
Perplexity (uncertainty), the lower the indicator, the more believable the generated text is for the model itself
SmallPerplexity: ratio of Large GPT3 Perplexity to Small GPT3 Perplexity
MediumPerplexity: Ratio of Large GPT3 Perplexity to Medium GPT3 Perplexity
Zlib entropy: calculating text entropy in a simplified way – using zlib compression
Lowercase: Perplexity ratio of the Large GPT3 obtained from the original generations and from the generation formatted in lowercase letters only.
Window: Minimum of Perplexity Large GPT3 of all sliding windows over text. Window size – 20% of the maximum length.
The authors of the method claim that out of the selected 1800 texts, on average, 33.5% have memorized sensitive data.
Second the method involves the generation and selection of texts using a graph (source [B]). The root of the tree-graph is a seed, which is supplied for generation, the generated words are located in the nodes of each next level, and the edge weights correspond to the probabilities of these words, with which they can be a continuation of the root phrase for this node. After constructing a tree-graph, using Dijkstra’s algorithm for finding the optimal path, the most probable lines are formed and their presence among the sensitive data is checked (in our case, a search on the Internet).
Third the method uses a specially trained attack model to extract data (source [С]). The attack model is a classifier that marks the type of data generated by the target model: whether they were in the training data or not.
The main difficulty of this method lies in the question: “How to train an attack model?”
To solve the problem, it is proposed to use the shadow learning method: in this case, several “shadow models” are created that imitate the behavior of the target. Shadow models should be created and trained similarly to the target model. The underlying idea is that similar models trained on relatively similar data records using the same method behave in a similar way.
Empirically, researchers have proven that the more shadow models, the more accurate the attack.
To train shadow models, the generation of training corpuses is required, if it is not known what the target model was trained on. Generation is done using the target model.
It is assumed that records that are classified by the target model with a high degree of confidence should be statistically similar to the training dataset and thus serve as a good training set for shadow models.
The synthesis process takes place in two stages: at (1) -th, using the ascent to the top search algorithm (a simple iterative algorithm for finding the local optimum), search for the space of possible data records that are classified by the target model with a high degree of reliability; at (2) stage, synthetic data is selected from these records.
In other words, first, an initial data set with predetermined threshold values is generated, a prediction is made on it by the target model, and then the threshold values are compared. If the probability of climbing a hill (the probability of evaluating the model) increases, then the parameters are accepted. Then some of the features are randomly changed and the next iteration is performed.
After the shadow data is generated, the shadow models are trained. The dataset is divided into training and test, the models are trained on the training. Further, each model, receiving both training and test sets as input, makes a prediction. The output of the model from the training data is labeled “in”, that is, the data was present during training, the test data was “out”, that is, it was not present during training.
The resulting predictions of labeled shadow models are combined into a single dataset to train the attacking model.
An additional, not independent, but an auxiliary method, you can consider the finetune of the model (additional training), or enumeration of the settings of the model’s hyperparameters. In this case, using seeds for personal data, we teach the model to generate texts with the meaning and format we need, and then use any of the above approaches.
Immediately, we note that we did not use the third method, since the shadow models should be similar to the target, and we did not have the capacity to train similar GPTs. It will take about six months to train one smallest model with 125 million parameters using Google Colab.
As a result of working with the models, we managed to get the following:
generated and found publicly available information on passports (series, number, year of issue) without confirmation of the owner’s full name (no data), such generations were only possible using the XL model
personal information on 4 veterans of the Great Patriotic War, including their military rank
found blocked and valid card numbers of Russian and foreign banks
information on name and date of birth for several people is confirmed
the model with high probability generates real addresses with correct relations: Region-District-City-Street-Zip
Based on the results of our experiments, we have not been able to obtain a method that would be highly likely to extract training data.
The few personal data that we received was the result of hundreds of generated and selected texts on the global network. At the same time, not all of the found PDNs are critical (for example, data on the Veterans of the Great Patriotic War).
The only thing that we managed to generate with a high degree of probability is real addresses. In this case, the beam search parameter works well.
Also, a lot depends on how the training data set was processed, whether Differential confidentiality methods were used during training (for example, intentional noise injection), regularization, whether the training set was cleaned of sensitive information or not.
So in their article on GPT trained models, the developers point out that “The AGI NLP team has done a lot of data cleaning and deduplication, as well as preparing the model validation and testing kits. “. Therefore, it is possible that there was no particularly critical data in the training set for these models initially.