My name is Andrey, I’m a non-functional tester at Tenzor. Our department is engaged in testing client performance of web applications Saby. In particular, one of the directions is testing the product for memory leaks. Since most of our applications are Single Page Applications, memory leaks can be a big problem for our users. Another vulnerable area of the system is the HTML build server (which prepares the html page before sending it to the client). It, like the SPA application, runs for a long time without rebooting, and therefore even a small constant leak can turn into a large memory consumption. Before sending to production, we run autotests for memory leaks. In the process of fixing the found leaks, we realized that developers often spend more time localizing a memory leak than fixing it.
Why it takes so long to locate a memory leak
There are many articles on the web about how to properly look for leaks and analyze the resulting memory dumps. However, the examples in them are greatly simplified, and from the very beginning you know what the leak is and where to look. In reality, memory dumps, in addition to leaks, contain many “necessary” objects. Looking at several tens of thousands of objects of the same type, sometimes you don’t know where to start parsing. For example, in fig. 1 below shows a small leak when going to the contact card on the site contacts.google.com. The leak is less than 100 KB, but more than 10,000 objects are already being created. If the site is more complex, then there may be even more objects. Therefore, we came up with the idea to make it easier to debug memory leaks by automating the search for objects that are leaking.
On fig. Figure 2 is a diagram of our classic memory leak test. The action under test can be either a user action in the browser or the execution of some code on the server side (Node.js). After 4 repetitions of the tested action, we get 4 memory dumps (HeapSnapshot, or just a snapshot) that need to be analyzed.
How do we look for objects in memory dumps that have leaked
Such objects have several characteristic features:
1. It must be objects of the same type (Object, Array, (string), etc.)
2. They must leak on every repetition of the action. If one of the repetitions does not contain such an object, it can no longer be called a leak in the usual sense.
3. These objects parents must be of the same type (that is, the chain of Retainers should be similar in appearance). This is because the leak comes from the same place in the code, which means that the context in which the objects appeared must be the same or similar in appearance.
Let’s understand clearly. Our example code (Figure 3) creates a leak: Array has a new Object with each iteration, with a different Object in one of its parameters. The selected piece of code is repeated 3 times, after each snapshot is taken. And after removing the latter, we proceed to their analysis by selecting the Object allocated between snapshots mode (Fig. 4).
Between 1 and 2 dumps, as well as between 2 and 3, an object of type Object is created. At the same time, their Retainers chain is similar (Object – Array – system / Context – …), but the identifiers of some objects are different.
Since all signs of leaked objects are present, it would be worth starting the analysis with the selected objects Object @175399 or @175441.
Search algorithm for leaked objects
So, after our classic test (Fig. 2), we have 4 snapshots. What we need to do to find memory leak objects in them:
Parse snapshots. We need to get a complete tree with connections. We used to parse json with a self-written script, but then we came across a parser in the Chromium sources and successfully use it (link). Additionally, it cleans the snapshot from various system objects, which simplifies analysis.
Filter out objects that have been removed by the GC. We need to find the objects that were created between adjacent snapshots and are present in the last snapshot (which means that the object was not deleted during garbage cleanup). You need to search for objects by id (in the mentioned lib this is the object_id parameter, and in devtools these are the numbers of the object with @ at the beginning), they are constant for one object between different snapshots. As a result, we get 3 groups of objects: created between 1-2 snapshots, between 2-3 and between 3-4 (Fig. 5 illustrates the diagram).
Filter out system objects and objects with zero weight. The proposed parser itself removes system objects at the parsing stage. Objects with zero weight may be present in later snapshots, but they are just empty shells. We assume they are needed to speed up the V8, but this issue has not been studied more precisely. However, they can interfere with analysis because they are not memory leaks.
Find all parents. It is necessary to compose all pairs of “object – parent”. Using a parser, this is as easy as shelling pears: an object has links to its parents in the edges_to variable, which lead to the parent object.
For each pair, get a string representation of their relationship. The string representation of the object will be its type (parameter class_name) – we need to connect the type of the object and the type of its parent. From the example in Fig. 4 the object with id @175367 has type Object, and its parent (object with id @136513) has type Array. Thus, we get a pair of Object – Array. It will be similar for objects @175409 – @136513. This is what we see in the Retainers chain in DevTools when manually analyzing.
Search for identical object-parent pairs between groups. This is the third sign of a leak – the leaked objects must have a similar chain of parents, since they have a similar context. If there are any, this is a suspected leak. We are looking for strings by coincidence, that is, in our example, we are looking for the same Object – Array in other groups obtained after step 3 (Fig. 6).
At this stage, we display in the assembly (Fig. 7) all found pairs from first snapshot indicating the id of the objects, as well as the path to last fourth snapshot. This is done on purpose, because these objects have already been passed through the garbage collector many times and they only have links that are memory leaks.
Next, we transfer the case for manual analysis – the developer just needs to open the snapshot and analyze the reason for keeping these objects in memory. The analysis that follows is no different from other methods for analyzing memory leaks in JS.
The disadvantages of the algorithm that we found
The presented algorithm has several disadvantages:
The test is relatively long (takes an average of 5–10 minutes). The slowest part is taking and getting the snapshot. If the action you are testing is performed many times faster than taking a snapshot, it is better to test for leaks in 2 runs:
in the first one, without taking a snapshot, look at the leak only by the size of the heap;
in the second, if the first run showed a leak, run an in-depth analysis with snapshots.
The algorithm sometimes has false positives. The greater the number of snapshots, the fewer these triggers, but the longer the test takes. 4 snapshots were chosen optimally for us (no more than 5% of false positives from the total number of tests). Rare false positives are removed by restarting the test. The reason is that with fewer snapshots, it’s more likely to get similar object-parent pairs in each snapshot that won’t leak. Also, some objects that are not garbage, but sit in memory to speed up V8 and will be deleted later, do not have time to delete.
What profit did we get from using the described algorithm
Greatly facilitated the analysis of memory leak errors for developers. Previously, for almost every error, the developer had to remember the basics of analyzing memory dumps, then deploy the local version of the product, and reproduce the bug. All this could take several days. Now, in most cases, we can immediately tell the developer which object in which module remains in memory. It remains for him to figure out the reasons for holding this object. Quite often, the reasons are trivial, and editing does not take much time.
We can more accurately determine if there is a leak. Very often Chrome creates system objects and keeps them in memory. When analyzed only by heap size, this can introduce a large error. According to the implemented algorithm, we filter out such objects and can understand if there is a leak from non-system objects.