Topleaked for analyzing memory leaks

What do most programmers do when they find out that their program is leaking memory? Nothing, let the user buy more RAM. I would dare to assume that they take a reliable time-tested tool such as valgrind or libasan, run and watch the report. It usually says that objects created on such and such a line of the program of such and such file were not released. And why? This is not written anywhere.
This post focuses on the topleaked leak finder, the underlying statistical analysis concept, and how it can be applied.

I already wrote about topleaked on Habré, but still I will repeat the main idea in general terms. If some objects are not freed, then they are accumulated in memory. This means that we have many homogeneous, similar sequences. If there is more leakage than is actually used, then the most frequent of them are parts of leaked objects. Typically, C ++ programs contain pointers to vtbl classes. This way we can figure out what type of objects we forget to free. It is clear that the top contains a lot of garbage, frequently encountered lines, and the same valgrind will tell us what and where flowed much better. But topleaked was not originally created in order to compete with technologies worked out over the years. It was conceived as a tool for solving a problem that cannot be solved by anything else – the analysis of irreproducible leaks. If the problem cannot be repeated in a test environment, then any dynamic analysis is useless. If the error occurs only “in battle”, and even unstable, then the maximum that we can get is logs and a memory dump. This dump can be analyzed in topleaked.

A simplified use case from the last post

Let’s take a simple C ++ program with a memory leak that will terminate itself with a memory dump due to abort ()

#include 
#include 
#include 

class A {
    size_t val = 12345678910;
    virtual ~A(){}
};

int main() {
    for (size_t i =0; i < 1000000; i++) {
        new A();
    }
    std::cout << getpid() << std::endl;
    abort();
}

Let's run topleaked

./toleaked leak.core

The default output format is human-readable line top.

0x0000000000000000 : 1050347
0x0000000000000021 : 1000003
0x00000002dfdc1c3e : 1000000
0x0000558087922d90 : 1000000
0x0000000000000002 : 198
0x0000000000000001 : 180
0x00007f4247c6a000 : 164
0x0000000000000008 : 160
0x00007f4247c5c438 : 153
0xffffffffffffffff : 141

It is not very useful, except that we can see the number 0x2dfdc1c3e, it is 12345678910, which occurs a million times. This could be enough, but I want more. In order to see the class names of the leaked objects, you can return the result to gdb by simply redirecting the standard output stream to gdb with an open dump file. -ogdb is an option that changes the format to understandable by gdb.

$ ./topleaked -n10 -ogdb /home/core/leak.1002.core | gdb leak /home/core/leak.1002.core
...<много текста от gdb при запуске>
#0  0x00007f424784e6f4 in __GI___nanosleep (requested_time=requested_time@entry=0x7ffcfffedb50, remaining=remaining@entry=0x7ffcfffedb50) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28      ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) $1 = 1050347
(gdb) 0x0:      Cannot access memory at address 0x0
(gdb) No symbol matches 0x0000000000000000.
(gdb) $2 = 1000003
(gdb) 0x21:     Cannot access memory at address 0x21
(gdb) No symbol matches 0x0000000000000021.
(gdb) $3 = 1000000
(gdb) 0x2dfdc1c3e:      Cannot access memory at address 0x2dfdc1c3e
(gdb) No symbol matches 0x00000002dfdc1c3e.
(gdb) $4 = 1000000
(gdb) 0x558087922d90 <_ZTV1A+16>:       0x87721bfa
(gdb) vtable for A + 16 in section .data.rel.ro of /home/g.smorkalov/dlang/topleaked/leak
(gdb) $5 = 198
(gdb) 0x2:      Cannot access memory at address 0x2
(gdb) No symbol matches 0x0000000000000002.
(gdb) $6 = 180
(gdb) 0x1:      Cannot access memory at address 0x1
(gdb) No symbol matches 0x0000000000000001.
(gdb) $7 = 164
(gdb) 0x7f4247c6a000:   0x47ae6000
(gdb) No symbol matches 0x00007f4247c6a000.
(gdb) $8 = 160
(gdb) 0x8:      Cannot access memory at address 0x8
(gdb) No symbol matches 0x0000000000000008.
(gdb) $9 = 153
(gdb) 0x7f4247c5c438 <_ZTVN10__cxxabiv120__si_class_type_infoE+16>:     0x47b79660
(gdb) vtable for __cxxabiv1::__si_class_type_info + 16 in section .data.rel.ro of /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) $10 = 141
(gdb) 0xffffffffffffffff:       Cannot access memory at address 0xffffffffffffffff
(gdb) No symbol matches 0xffffffffffffffff.
(gdb) quit

Reading is not very easy, but possible. Lines like $ 4 = 1,000,000 reflect the position in the top and the number of matches found. Below are the results of running x and info symbol for the value. Here we can see that vtable for A is encountered a million times, which corresponds to a million leaked objects of class A.

I already wrote about all this. As it was correctly noted in the comments, the idea is already a hundred years old by lunchtime, or at least 15. The story does not end there, it just begins.

It is clear that, but why?

Before kata, an important question was posed - why does the memory leak? Debugging utilities typically tell you where an object or array was created, but not where it should be deleted. And topleaked is no exception. Only a programmer can understand why this or that piece of memory was not freed, having found an erroneous script. But what if we go beyond searching for types? If we can walk through all the objects that we consider leaked, then we can look for similarities among them. Let me give you a real example, the story for which the new feature was written.

Closer to business and ... problems

There is a service. It is heavily loaded, hundreds of thousands of users pass through it. Leakage of any trifle for every request or connection of death is similar - it will explode in a matter of minutes. The service was debugged and worked for 3 months without restarting the process. Not a record uptime, but still. And after 3 months we find out that all this time he was leaking a little bit. It seems to be a trifle, it exceeded its regular consumption by 2-3 times - it restarted and forgot. But file descriptors flowed along with memory. Since the service is completely networked, these descriptors are unclosed sockets, which means we have a problem in logic. The service is written almost entirely in C ++ with very little pearls in it. This actually has little effect on the subsequent narrative, but it's easier to start from specifics. You might as well have made the same mistake in C, D, Rust, Go, or NodeJS. And you can search for it in the same way, except that there would be problems with js.

The method of looking at the code gave us nothing. All possible, as it seemed to me then, scenarios for using network connections lead to the loss of references to the object, which, thanks to a smart pointer, leads to a destructor that will certainly do close. Analysis and monitoring has estimated that approximately every hundredth socket is not closed. Sessions are long (game sessions of game clients), so it took months to hit the limit of open fd per process (512000 in our case). It was also impossible to find signs from these unclosed clients in the logs. At first glance, everything that was open was closed. There was nowhere else to look, and I climbed to read the process memory dump, taken shortly before the maximum open connections were reached.

We collect statistics

The first topleaked launch reported an obvious fact - client connection objects are leaking. Thank you, captain, we already knew this from the unclosed sockets. We are interested in the specifics of these connections, because the bulk dies regularly when it is supposed to. And here the idea was born: what if we go through all these objects in the dump and see their state. In this case, we had a state-enum property in the class, which is responsible for the logical state of the client. Relatively speaking: not connected, connected, websocket handshake passed, authorization passed. If you know from what state objects are leaking, then it is easier to search.

There is a catch here. Topleaked does not understand dump formats, it just opens the file as a binary stream, cuts 8 bytes each and builds the top of the most frequent 8-byte sequences. This is not some complicated idea, it was easier to write the first version, and then there is nothing more permanent than something temporary. But due to the lack of structure, it is impossible to understand where the values ​​we need lie. All we have is the value of the pointer to vtbl of the class of interest to us. And we also know that these pointers, like all properties, “lie in the object”. That is, you can search in the dump for the pointer to vtbl of interest and state will be at some offset relative to the position found in the file. This offset is fixed, since it only depends on the class layout. It remains only to find this offset.

In the case of C ++, there is a problem - the absence of an ABI or any intelligible rules for the arrangement of properties in objects. For POD or trivial types, there are clear rules from the world of C. But the location of the pointer to the virtual table, like the very existence of the virtual table, is not standardized. Fortunately, in practice, everything is simple. If you are not too smart with multiple inheritance and consider the final class in the hierarchy, then on linux gcc it turns out that vtbl is the first property of the object. So offsetof (state) is our offset. In a simpler example, it looks like this:

struct Base {
    virtual void foo() = 0;
};

struct Der : Base {
    size_t a = 15;
    void foo() override {

    }
};
int main()
{
    for (size_t i = 0; i < 10000; ++i) {
        new Der;
    }
    auto d = new Der;
    cout << offsetof(Der, a) << endl;
    abort();
    return 0;
}

Here we printed offsetof Der :: a, 10000 objects "leaked" and fell. First, let's run topleaked in normal mode

topleaked  my_core.core
0x0000000000000000 : 50124
0x000000000000000f : 10005
0x0000000000000021 : 10004
0x000055697c45cd78 : 10002
0x0000000000000002 : 195
0x0000000000000001 : 182
0x00007fe9cbd6c000 : 167
0x0000000000000008 : 161
0x00007fe9cbd5e438 : 154
0x0000000000001000 : 112

0x000055697c45cd78 is a pointer to the vtbl of the Der class. offsetof equals 8. So you need to look for this pointer, back off by 8 and read the value. To search, we will use a separate topleaked mode of operation - search. The -f flag is responsible for what we will look for in the dump, --memberOffset is the offset of the field of interest relative to that found in -f, and --memberType is the type of the field. Uint8, uint16, uint32 and uint64 are supported.

topleaked my_core.core -f0x55697c45cd78 --memberOffset=8 --memberType=uint64

We get:

0x000000000000000f : 10001
0x000055697ccaa080 : 1

We see 10,000 values ​​0x0f, which we recorded ourselves, as well as a little noise.

Happy end

In a real situation, everything works in much the same way. First, in a test environment, I made sure that the offset is correct and the search finds what I need, and then I ran it on a real dump. The resulting conclusion surprised at first, and then pleased. There were several thousand authorized clients, the numbers corresponded to the number of online users at the time of the fall. But most importantly, there were hundreds of thousands of not just unauthorized objects, but objects in the very first state. This state means that clients have connected to the server via TCP, but have not sent a byte - no websocket upgrade or anything unexpected. They connected and were silent. This is the easiest place to debug - our code is minimal, which means there is nowhere to be wrong. Everything turned out to be simple, the author of the code (I confess it was me) did not understand the TCP guarantees. If you do not enable additional options and do not try to do anything with the socket, then it is impossible to understand in any way that it has disconnected. There are no built-in pings or inactivity timeouts by default. There is only an extension that everyone supports, but which is disabled - TCP Keep Alive. More details can be read https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

The most unpleasant thing is that we actually knew about it. Added similar logic to the game protocol with checking for inactive sockets, and it works. It only turns on after establishing a websocket connection. Therefore, only those who lost connection before sending the first packet were exposed to the leak.

A little more about D

I can't help but note how easy it was to add the functionality described above. If you look commit, then we will see that for each supported data type (uint 8/16/32/64) a line has been added:

readFile(name, offset, limit)
    .findMember!uint64_t(pattern, memberOffset)
    .findMostFrequent(size).printResult(format);

findMember is a new function that implements an offset, and findMostFrequent is the same function that builds the top of the most frequent values. Thanks to the templates and algorithms on the ranges, there was no need to change anything. Despite the fact that initially this function worked with arrays, but now it was given a very peculiar iterator that searches and jumps through the file.

Installation instructions

There are no binary assemblies, so one way or another you will need to build the project from source. This requires the D compiler. There are three options: dmd - the reference compiler, ldc - based on llvm and gdc, included in gcc since version 9. So you may not need to install anything if the latest gcc is available. If installed, I recommend ldc as it optimizes better. All three can be found at official website...
The package manager dub is supplied with the compiler. With it, topleaked is installed with one command:

dub fetch topleaked

In the future, we will use the command to start:

dub run topleaked -brelease-nobounds --  [...]

In order not to repeat the dub run and the brelease-nobounds compiler argument, you can download the sources from githaba and build the executable file:

dub build -brelease-nobounds

Topleaked will appear at the root of the project folder.

Link to github

PS Thanks to Crazy Panda for the opportunity to make and use such things in work, as well as for the motivation to write posts. Otherwise, the text would have been gathering dust for another year on the hard disk, as it was with the last post about topleaked.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *