New machine learning protocol Confidential-DPproof from Brave

Summary. Machine learning models trained on participant data without any privacy guarantees can leak sensitive information. For example, when applying machine learning to advertising, we ideally want to learn general patterns (“show science-based ads to users who visit science sites”), but the parameters of the model may encode specific facts about the interests of an individual user (“Ivan visited https://brave.com/research/ March 7, 2024″‎). Unfortunately, some organizations may not adhere to the stated principles of training models taking into account the confidentiality of user data. This can happen either intentionally or unintentionally (for example, in the case of hard-to-detect bugs).

Our verifiable private training model, called Confidential-DPproof (Confidential proof of differentially private training), promotes more confidential training without revealing any information about the data used and the model itself.

Confidential-DPproof is available as open source and can be used by organizations to provide verifiable privacy protections for customer data in machine learning-based products.

Why is privacy of user data important when training models?

Machine learning models trained without a privacy-protecting algorithm may remember sensitive information from their training set. Publishing the parameters or even just the results of machine learning models can compromise the privacy of the users who contributed to their training process. For example,

  1. Attacks type membership inference use access to model results to draw conclusions about the presence (or absence) of specific data in its training set.

  2. Attacks type data reconstructionwhere the original data from the training set is restored through the analysis of model parameters.

How can we protect user privacy?

Framework differential privacy is the gold standard for formalizing privacy guarantees. When training machine learning models, the differential privacy learning algorithm provides assurances to participants in a mathematically precise manner that the data provided by each individual participant has limited impact on the final model. Thus, the absence or presence of a particular participant cannot create a significantly different model.

The approach known as Differentially Private Stochastic Gradient Descent is a canonical approach to training machine learning models with differential privacy guarantees. It guarantees differential privacy by: 1) pruning gradients for each example to a fixed norm to limit the algorithm's sensitivity to each individual participant's data (assuming each participant contributes only one data point to the training); and 2) adding calibrated noise to the gradients before applying them to update the model to ensure indistinguishability between updates. Training machine learning models using this approach significantly reduces the risk of any attacks targeting the training data.

Why is verifiable privacy important?

The implementation of such learning algorithms can be complex, and hard-to-detect errors can easily arise from unusual and complex modifications to the underlying learning algorithm. Therefore, organizations may not actually adhere to their differential privacy guarantee statements when training models, either intentionally or accidentally. For example, you can familiarize yourself with non-obvious points regarding Apple’s work with a similar model Here.

It is therefore important that companies or other organizations using DP-SGD can demonstrate that they are indeed training models within appropriate data privacy protections.

Confidential-DPproof framework

We have developed the Confidential-DPproof framework, which allows organizations to directly and proactively prove, using the Zero Knowledge Proof protocol, to any stakeholder: users, customers, regulators or governments that their machine learning models were indeed trained while preserving data confidentiality.

How does Confidential-DPproof work?

We propose a Zero Knowledge Proof protocol to proactively generate a privacy certificate during the training process while preserving the privacy of all relevant information, including training data and model parameters. Zero Knowledge Proof allows a party to prove claims about the confidentiality of their data without disclosing it. Thus, the prover's dataset or model is not disclosed to third parties.

Participants. Let's consider a situation with two parties: the prover and the auditor:

  • A prover is an organization that wants to train a model on a sensitive dataset for machine learning purposes.

  • An auditor is a third party (such as a user or an auditor) whose purpose is to verify the privacy guarantees of the prover model.

Desired characteristics:

  1. Completeness. The auditor must be able to infer the assurance of confidentiality and provide publicly available evidence of that assurance.

  2. Reliability. The auditor must be able to detect malicious provers who, intentionally or accidentally, do not adhere to their statements about training machine learning models with confidentiality guarantees.

  3. Zero Knowledge. Training data and model parameters obtained during training must remain confidential as organizations are unwilling/unable to share their models and data with third parties for reasons of data confidentiality and intellectual property concerns.

The Confidential-DPproof framework works as follows:

  1. The prover publicly announces the confidentiality guarantees he plans to provide;

  2. The prover and the auditor agree on the Differentially Private Stochastic Gradient Descent learning algorithm and the specific values ​​of its hyperparameters.

  3. The prover and the auditor launch our new zero-knowledge protocol.

  4. The prover receives a certificate of the stated guarantee of confidentiality.

What is the performance of Confidential-DPproof?

Because we designed Confidential-DPproof to enable organizations to prove the confidentiality of their models to auditors while maintaining the confidentiality of their data and model, we address the following questions:

  1. What trade-offs between utility and certified privacy guarantees can Confidential-DPproof achieve, considering i) the organizations' respective interests (in terms of utility); ii) the auditor's relevant interests (from a confidentiality perspective); and iii) the limitations imposed by the zero-knowledge verification model regarding model complexity?

  2. What is the cost (in terms of execution time) of verifying the training process while maintaining confidentiality?

Conclusion: ConfidentialDPproof works in a practical execution time!

You can check out full version of our articleto find out more.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *