We continue the topic of information security and publish the translation articles Coussement Bruno.
Add noise to existing data, add noise only to data manipulation results, or generate synthetic data? Let’s trust our intuition?
Companies are growing and their cybersecurity regulations are becoming stricter, senior architects are embracing trends … All of this leads to the need (or obligation) to reduce the risks of privacy and information leakage only intensifies for data subjects.
In this case, methods of anonymizing or tokenizing data are widely used, although they also allow the possibility of disclosing private information (see this articleto understand why this is happening).
Generating synthetic data
Synthetic data have a fundamental difference. The goal is to create a data generator that shows the same global statistics as the original data. Distinguishing the original from the result should be difficult for a model or person.
After training the model on this table, I generated 5000 rows and plotted a column bar chart Elevation source and generated set. It seems that both lines visually coincide.
To test the relationship between pairs of bars, a paired graph of all continuous bars is shown. The shape that the blue-green dots form (generated) should visually match the shape of the red dots (original). And so it happened, cool!
If we now look at the mutual information (also known as unsigned correlation) between columns, then the columns that are correlated with each other should also be correlated in the generated set. Conversely, non-correlated columns in the original set should not be correlated in the generated set. A value close to 0 means no correlation, and a value close to 1 means perfect correlation. Great, it is!
Mutual information between columns original set:
Mutual information between columns generated set:
As a final test, I wanted to train a nonlinear dimensionality reduction method (UMAP) on the source set and project the origin points into 2D space. I enter the generated set into the same projector. The orange crosses (generated) should be in the blue point clouds of the original dataset. And there is! Excellent!
OK, experimenting with data is fun!
For more serious cases, there are 2 main approaches:
- Rule-based Stochastic Data Generation: the user specifies the distributions of the sample and the specific rules for the sample.
– column A: must contain female names,
– column B: must contain the name of the European country,
– Column C: must be an evenly sampled integer from 1 to 100 if the country in column B is France, otherwise a constant.
- Deep generative models: can be used to study the statistical distribution from which the real data were supposedly taken. If you have an approximation to this distribution, you can randomly sample a synthetic dataset of arbitrary size from it. This is what all the cool guys are doing now.
Today you can already write a proof-of-concept using synthetic data to solve one of the following common problems faced by IT organizations:
- No payload in development environment
Let’s say you are working on a data product (it can be anything) where the data you are interested in is in a production environment with a very strict access policy. Unfortunately, you only have access to the development environment without interesting data.
- God Mode – Access Rights for Engineers and Data Scientists
Let’s say you’re a data scientist and suddenly an information security officer has limited your much-needed privileges to access production data. How can you continue to do a good job in such a tough, limited environment?
- Transfer of sensitive data to an untrusted external partner
You are part of Company X. Organization Y would like to showcase their latest cool data product (it could be anything).
They ask you to extract data to show the product to you.
How does synthetic data relate to differential privacy?
The main property of synthetic data generation is that, regardless of post-processing or adding third-party information, no one will ever be able to know whether an object is contained in the original set, and also will not be able to get the properties of this object. This property is part of a broader concept called “differential privacy” (DP).
Global and local differential privacy
DP is divided into 2 types.
Often, only the result of a specific task is of interest (for example, training a model based on undisclosed data of patients from different hospitals, calculating the average number of people who have ever committed a crime, etc.), then attention should be paid to global differential privacy.
In this case, an untrusted user will never see confidential data. Instead, he or she tells a trusted curator (with global differential privacy mechanisms) who has access to sensitive data what operations to perform.
In contrast, if data is to be transferred to an untrusted party, the principles of local differential confidentiality come into play. Traditionally, this is accomplished by adding noise to every row in a table or database. The amount of added noise depends on:
- the required level of confidentiality (the famous epsilon in the DP literature),
- the size of the dataset (a larger dataset requires less noise to achieve the same level of confidentiality),
- column data type (quantitative, categorical, ordinal).
In theory, for an equal level of confidentiality, the global DP mechanism (adding noise to the result) will provide more accurate results than the local mechanism (line-level noise).
Thus, synthetic data generation methods can be thought of as a form of local DP.
For more information on these topics, I advise you to consult the following sources:
Let’s now look at a more specific example. You want to share a spreadsheet containing personal information with an untrusted party.
Right now, you can either add noise to existing data lines (local DP), set up and use a robust system (global DP), or generate synthetic data based on the original.
Noise should be added to existing data lines if
- you do not know what operation will be performed on the data after publication,
- you need to periodically share an update to the original data (= have this workflow as part of a stable batch process),
- you and the data owners trust the person / team / organization to add noise to the original data.
Here I recommend starting with the tools OpenDP…
The most famous case of differential confidentiality is in the United States Census data (see. databricks.com/session_na20/using-apache-spark-and-differential-privacy-for-protecting-the-privacy-of-the-2020-census-respondents).
This data is recalculated and updated every three years. It is mostly numerical data that is aggregated and published at multiple levels (county, state, national level).
Install and use a trusted system if
- the system you specified supports the tasks and operations that will be performed on it,
- basic data is stored in different places and cannot leave them (for example, in different hospitals),
- you and the data owners actually trust the current system and the person / team / organization that is setting it up.
As a user of sensitive data, you will get more accurate results than the first approach.
Many frameworks do not currently have all the necessary features to deploy this beast in a secure, scalable, and auditable way. There is still a lot of engineering work required here.
But as their adoption grows, DP can be a good alternative for large organizations and businesses.
I advise you to start here with OpenMined…
It is possible to generate synthetic data if
- the original table is relatively small (<1 million rows, <100 columns),
- enough ad-hoc generation (periodic generation is not required),
- you and the data owners trust the person / team / organization that will generate synthetic data for you.
As with the little experiment described above, the results are promising. It also does not require excellent knowledge of DP systems. You can start today, if need be, let it train all night and, so to speak, have the split synthetic set ready for tomorrow morning.
The biggest drawback is that these complex models can become expensive to train and maintain if the amount of data increases. Each table also requires its own complete model training (portable training will not work here). You won’t be able to scale to hundreds of tables even with a significant computational budget.
Otherwise, you’re out of luck.
Since data privacy is more important now than ever, we have excellent methods for generating synthetic data or for adding noise to existing data. However, they all still have their limitations. Apart from a few niche cases, a scalable and flexible enterprise-grade tool has not yet been created that would allow data containing personal information to be transferred to untrusted parties.
Data owners still need to trust established methods or systems, which requires a lot of trust from them. This is the biggest problem!
Until then, if you want to give it a try (proof of concept, just test it out), open any of the links above.