In Brave, privacy is not a feature, but a requirement around which the project is built. Our browser fully demonstrates this: we block trackers, prevent digital fingerprints from being tracked, and offer users our own privacy-first ad service.
In today’s issue, we will consider the method of hiding the IP addresses of users from us and the partner CDN, as well as other privacy issues of the Brave Today browser recommendation feed.
It is known that in the process of work, Brave transmits a minimum of information to our servers, and only when it is needed. But this is not enough – in order to do the truly right things, you should set yourself the goal of making it fundamentally impossible to harm user privacy.
How to avoid revealing the IP addresses of browser users? In general, we filter them from requests to our backends at the level of the content delivery network (also known as CDN) so that the address cannot even accidentally get into the logs. But for one of our new services, we want to take it a step further – to make sure that no CDN configuration change could expose clients’ IP addresses, even if we wanted to.
Let’s describe point by point what and why we are doing.
New browser features require clients’ IP addresses to be better protected than we have done before. One such feature is our Brave Today recommendation feed on a browser-based new tab.
You can simply deliver news through a regular content delivery network: the news itself is the same for all users, and a specific selection and machine learning occurs locally in the browser itself. There are nuances in the illustrations for the news. If you download images while browsing the feed, anyone who can observe these network requests (for example, a CDN) can find out which articles were shown in the feed. From these facts, you can extract information about which model is used by machine learning on the client, and indirectly about which pages were previously visited to train this model.
Traditionally, content from latency-sensitive services is delivered and cached using a CDN. Obviously, the CDN operator sees both the content of user requests and its IP address. In our case (see Privacy), these data streams should be separated. We decided to add a load balancer in front of the CDN. The result was the following:
In such a model, the balancer vendor does not see the content of the requests or responses. All that is available to him is encrypted traffic on port 443, which can only be decrypted by the content delivery network. At the same time, everything inside requests and responses is available to the CDN vendor, but instead of the user’s IP address, they receive one of the load balancer’s addresses. And so that no extraneous IP addresses connect to the side of the CDN, it only connects to the balancer’s IP addresses. Similarly, the S3 container, from which the network takes data, sends data only by the key that the CDN has.
It is also important to use different load balancer vendors and for CDNs, which is what we do. This is necessary in order to minimize the theoretical risk of their collusion to de-anonymize clients.
We need to go deeper
Obviously, the TCP load balancer cannot decrypt traffic going through it. But it can track the size of HTTP requests and responses, which is an attack vector for this data. We have no reason to suspect that our vendor partners will try to spy on users, but such beliefs are irrelevant when technical solutions exist: we add padding to requests to make their sizes as similar as possible.
For example, the service requests images:
These queries can be changed like this:
As for HTTP responses, it can be costly to bring all files to a single size, but you can set several types and select the one you want. This should be done by every application that uses the CDN for its specific requests. It is clear that for this algorithm to work correctly, you need to disable compression on the CDN side.
Even though the user’s IP address does not end up in the CDN in any way, some HTTP headers can be used for fingerprinting, depending on how unique their values are. Therefore, every application that uses our private content delivery network must strip such headers from requests. It:
What’s in the browser?
Everything that has been said so far has more to do with the vendors of our infrastructure. However, what about Brave itself, because we have access to partner dashboards?
In order to identify a user by a set of queries, we would need:
Have access to the logs of both systems,
Add additional information to requests when they are routed from a TCP load balancer.
With logs, everything is simple – according to the agreement with the balancer vendor, our account has disabled access to the logging mechanisms. We also cannot add information to the headers, since the load balancer cannot decrypt the TLS stream. In theory, we could configure it so that the proxy protocol appends the client’s IP address to all outgoing requests, but, fortunately, our CDN provider basically cannot do this. In case such an opportunity arises in the future, there is a special clause in our contract that stipulates this limitation.
Trust but verify
The importance of careful design of such a system is undeniable, but words and diagrams mean nothing without the ability to verify our claims. To justify the trust of our users, we try to show our work as transparently as possible.
First, everyone can see how the data is processed on the client side, since Brave – open source browser… Secondly, it is easy to check which IP address the browser is accessing by analyzing the traffic, for example, using mitmproxy, or simply by looking at what the pcdn.brave.com host resolves to. Finally, to test the forwarding of requests from the first vendor and the point where TLS decrypts, one can compare the response headers from https://pcdn.brave.com/ and sites that are served directly by the first vendor – for example, https://haveibeenpwned.com/…
In the unlikely event that you find any bugs or violations of the privacy model, immediately report to our bug finder