How do profilers work in Ruby and Python?

7 min


The translation of the article was prepared on the eve of the start of the advanced course “Python Developer”


Hello! As an aperitif to a Ruby profiler, I wanted to talk about how existing Ruby and Python profilers work. It will also help answer the question that many people ask me: “How to write a profiler?”

In this article, we will focus on processor profilers (not, say, memory / heap profilers). I will cover some basic approaches to writing a profiler, provide code examples and show many examples of popular profilers in Ruby and Python, and also show you how they work under the hood.

Probably, there may be errors in the article (in preparation for writing it, I partially looked through the code of 14 libraries for profiling, and I was not familiar with many of them until this moment), so please let me know if you find them …

2 types of profilers

There are two main kinds of processor profilers – sampling- and tracing-profilers.

Tracing profilers record every function call in your program, ultimately providing a report. Sampling profilers take a statistical approach, they write the stack every few milliseconds, generating a report based on this data.

The main reason to use a sampling profiler instead of a tracing profiler is because it is lightweight. You take 20 or 200 shots per second – it doesn’t take much time. Such profilers will be very effective if you have a serious performance problem (80% of the time is spent calling one slow function), since 200 snapshots per second will be enough to identify the problem function!

Profilers

Next, I will give a general summary of the profilers discussed in this article (from here). I will explain the terms used in the article (setitimer, rb_add_event_hook, ptrace) a little bit later. Interestingly, all profilers are implemented using a small set of basic features.

Python profilers

“Gdb hacks” not quite a Python profiler – it links to a website telling how to implement a hacker profiler as a shell script wrapper around gdb… This is exactly about Python, since new versions gbd actually deployed the Python stack for you. Something like pyflame for the poor.

Ruby profilers

Almost all of these profilers live inside your process.

Before we get into the details of these profilers, there is one very important thing – all of these profilers except pyflameare run inside your Python / Ruby process. If you are inside a Python / Ruby program, you usually have easy access to the stack. For example, here’s a simple Python program that prints the stack contents of each running thread:

import sys
import traceback

def bar():
    foo()

def foo():
    for _, frame in sys._current_frames().items():
        for line in traceback.extract_stack(frame):
            print line

bar()

Here is the console output. You can see that it has function names from the stack, line numbers, file names – whatever you might need if you are profiling.

('test2.py', 12, '', 'bar()')
('test2.py', 5, 'bar', 'foo()')
('test2.py', 9, 'foo', 'for line in traceback.extract_stack(frame):')

It’s even easier in Ruby: you can use puts callerto get a stack.
Most of these profilers are performance extensions of C, so they differ slightly, but such extensions for Ruby / Python programs also have easy access to the call stack.

How tracing profilers work

I have listed all tracing-Ruby and Python profiles in the tables above: rblineprof, ruby-prof, line_profiler and cProfile… They all work in a similar way. They record every function call and are C extensions to reduce overhead.

How do they work? In both Ruby and Python, you can specify a callback that is triggered when various interpreter events occur (for example, “a function call” or “a line of code execution”). When the callback is called, it writes the stack for later analysis.

It’s helpful to see exactly where these callbacks are in the code, so I’ll link to the relevant lines of code on github.

In Python, you can set up a callback with PyEval_SetTrace or PyEval_SetProfile… This is described in the documentation section Profiling and Tracing in Python. It says: “PyEval_SetTrace looks like PyEval_SetProfileexcept that the trace function receives line number events. “

Code:

  • line_profiler sets up its callback using PyEval_SetTrace: cm. line_profiler.pyx line 157
  • cProfile sets up its callback using PyEval_SetProfile: cm. _lsprof.c line 693(cProfile is implemented using lsprof)

In Ruby, you can set up a callback with rb_add_event_hook… I couldn’t find any documentation about it, but this is how it looks.

rb_add_event_hook(prof_event_hook,
      RUBY_EVENT_CALL | RUBY_EVENT_RETURN |
      RUBY_EVENT_C_CALL | RUBY_EVENT_C_RETURN |
      RUBY_EVENT_LINE, self);

Signature prof_event_hook:

static void
prof_event_hook(rb_event_flag_t event, VALUE data, VALUE self, ID mid, VALUE klass)

Something like PyEval_SetTrace in Python, but in a more flexible form – you can choose which events you want to be notified about (for example, “only function calls”).

Code:

  • ruby-prof causes rb_add_event_hook here: ruby-prof.c line 329
  • rblineprof causes rb_add_event_hook here: rblineprof.c line 649

Disadvantages of tracing profilers

The main drawback of tracing profilers implemented in this way is that they add a fixed amount of code for each function / line call that is executed. It can make you make the wrong decisions! For example, if you have two implementations of something – one with a lot of function calls and one without, which usually take the same amount of time, then the first, with a lot of function calls, will appear slower when profiling.

To show it clearly, I have created a small file called test.py with the following content and compared the execution time python -mcProfile test.py and python test.pypython. test.py completed in about 0.6 s, and python -mcProfile test.py – in about 1 s. So for this particular example cProfile added an additional ~ 60% overhead.

In the documentation cProfile said:
Python’s interpreted nature adds so much run-time overhead that deterministic profiling tends to add little processing overhead in regular applications

This seems like a pretty reasonable statement – the previous example (which makes 3.5 million function calls and nothing else) is obviously not a regular Python program, and almost any other program will have less overhead.
I have not checked the overhead ruby-prof (Ruby tracing profiler), but his README says the following:

Most programs will run about half as slow, while highly recursive programs (such as the Fibonacci series test) will run three times slower

How sampling profilers generally work: setitimer

Time to talk about the second kind of profiler: sampling profilers!
Most Ruby and Python sampling profilers are implemented using the system call setitimer… What it is?

Let’s say you want to take a snapshot of the program stack 50 times per second. This can be done as follows:

  • Ask the Linux kernel to send you a signal every 20 milliseconds (using the system call setitimer);
  • Register a signal handler for a stack snapshot when a signal is received;
  • When profiling is complete, ask Linux to stop signaling you and provide the result!

If you want to see a practical use case setitimer to implement a sampling profiler, I think that stacksampler.py – the best example is a useful, working profiler, and it is about 100 lines in Python. And this is very cool!

Reason for which stacksampler.py takes only 100 lines in Python, is that when you register a Python function as a signal handler, the function is pushed onto the current stack of your program. Therefore the signal handler stacksampler.py registration is very simple:

def _sample(self, signum, frame):
   stack = []
   while frame is not None:
       stack.append(self._format_frame(frame))
       frame = frame.f_back

   stack = ';'.join(reversed(stack))
   self._stack_counts[stack] += 1

It simply pops a stack from a frame and increments the number of times that a particular stack has been viewed. So simple! So cool!

Let’s take a look at all of our other profilers that use setitimer, and find out where in the code they call setitimer:

  • stackprof (Ruby): in stackprof.c line 118
  • perftools.rb (Ruby): in this patch that seems to apply when gem compiled (?)
  • stacksampler (Python): stacksampler.py line 51
  • statprof (Python): statprof.py line 239
  • vmprof (Python): vmprof_unix.c line 294

The important thing about setitimer – you need to decide how to count the time. Do you want 20ms real time? 20ms user cpu time? 20ms user + system cpu time? If you look closely at the links above, you will notice that these profilers actually use differently setitimer – sometimes behavior is customizable and sometimes not. Man page setitimer short and worth reading for all possible configurations.

@mgedmin in twitter pointed out one interesting drawback of using setitimer… This issue and this one issue reveal more details.

One INTERESTING drawback of profilers based on setitimer – what timers trigger signals! Signals sometimes interrupt system calls! System calls sometimes take a few milliseconds! If you take snapshots too often, you can make the program execute system calls indefinitely!

Sampling profilers that don’t use setitimer

There are several sampling profilers that do not use setitimer:

  • pyinstrument uses PyEval_SetProfile (so it’s kind of a tracing profiler), but it doesn’t always collect stack snapshots when invoking the trace callback. Here is the code that picks the snapshot time stack trace… Read more about this solution in this blog… (basically: setitimer allows you to profile only the main thread in Python)
  • pyflame profiles Python code outside of a process using a system call ptrace… He uses a loop where he takes pictures, sleeps for a certain amount of time and does the same thing again. Here’s a call to wait.
  • python-flamegraph takes a similar approach where it starts a new thread in your Python process and gets the stack traces, sleeps and loops all over again. Here’s a call to wait

All 3 of these profilers take real-time snapshots.

Pyflame blog posts

I’ve spent almost all of my time in this post on profilers other than pyflamebut it actually interests me the most because it profiles your Python program from a separate process, which is why I want my Ruby profiler to work the same way.

Of course, everything is a little more complicated than I described. I won’t go into details, but Evan Klitzke has written many good articles about this on his blog:

More information can be found at eklitzke.org… These are all very interesting things, which I am going to read more carefully – perhaps ptrace turns out to be better than process_vm_readv to implement a Ruby profiler! Have process_vm_readv less overhead since it doesn’t stop the process, but it can also give you a bad shot as it doesn’t stop the process :). In my experiments, getting conflicting shots was not a big problem, but I think that here I will conduct a series of experiments.

That’s all for today!

There are many important subtleties that I did not go into in this post – for example, I basically said that vmprof and stacksampler – are similar (they are not – vmprof supports string profiling and profiling of Python functions written in C, which I believe makes the profiler more complex). But they have some of the same basic principles, and so I think today’s review will be a good starting point.


TDD with and without pytest. Free webinar


Read more:

  • 5 visualizations to improve your Data Story
  • An introduction to Python Functools
  • Why you should start using FastAPI right now


0 Comments

Leave a Reply