Hello! As an aperitif to a Ruby profiler, I wanted to talk about how existing Ruby and Python profilers work. It will also help answer the question that many people ask me: “How to write a profiler?”
In this article, we will focus on processor profilers (not, say, memory / heap profilers). I will cover some basic approaches to writing a profiler, provide code examples and show many examples of popular profilers in Ruby and Python, and also show you how they work under the hood.
Probably, there may be errors in the article (in preparation for writing it, I partially looked through the code of 14 libraries for profiling, and I was not familiar with many of them until this moment), so please let me know if you find them …
2 types of profilers
There are two main kinds of processor profilers – sampling- and tracing-profilers.
Tracing profilers record every function call in your program, ultimately providing a report. Sampling profilers take a statistical approach, they write the stack every few milliseconds, generating a report based on this data.
The main reason to use a sampling profiler instead of a tracing profiler is because it is lightweight. You take 20 or 200 shots per second – it doesn’t take much time. Such profilers will be very effective if you have a serious performance problem (80% of the time is spent calling one slow function), since 200 snapshots per second will be enough to identify the problem function!
Next, I will give a general summary of the profilers discussed in this article (from here). I will explain the terms used in the article (setitimer, rb_add_event_hook, ptrace) a little bit later. Interestingly, all profilers are implemented using a small set of basic features.
“Gdb hacks” not quite a Python profiler – it links to a website telling how to implement a hacker profiler as a shell script wrapper around gdb… This is exactly about Python, since new versions gbd actually deployed the Python stack for you. Something like pyflame for the poor.
Almost all of these profilers live inside your process.
Before we get into the details of these profilers, there is one very important thing – all of these profilers except pyflameare run inside your Python / Ruby process. If you are inside a Python / Ruby program, you usually have easy access to the stack. For example, here’s a simple Python program that prints the stack contents of each running thread:
import sys import traceback def bar(): foo() def foo(): for _, frame in sys._current_frames().items(): for line in traceback.extract_stack(frame): print line bar()
Here is the console output. You can see that it has function names from the stack, line numbers, file names – whatever you might need if you are profiling.
('test2.py', 12, '
', 'bar()') ('test2.py', 5, 'bar', 'foo()') ('test2.py', 9, 'foo', 'for line in traceback.extract_stack(frame):')
It’s even easier in Ruby: you can use puts callerto get a stack.
Most of these profilers are performance extensions of C, so they differ slightly, but such extensions for Ruby / Python programs also have easy access to the call stack.
How tracing profilers work
I have listed all tracing-Ruby and Python profiles in the tables above: rblineprof, ruby-prof, line_profiler and cProfile… They all work in a similar way. They record every function call and are C extensions to reduce overhead.
How do they work? In both Ruby and Python, you can specify a callback that is triggered when various interpreter events occur (for example, “a function call” or “a line of code execution”). When the callback is called, it writes the stack for later analysis.
It’s helpful to see exactly where these callbacks are in the code, so I’ll link to the relevant lines of code on github.
In Python, you can set up a callback with
PyEval_SetProfile… This is described in the documentation section Profiling and Tracing in Python. It says: “
PyEval_SetTrace looks like
PyEval_SetProfileexcept that the trace function receives line number events. “
line_profilersets up its callback using
cProfilesets up its callback using
_lsprof.cline 693(cProfile is implemented using lsprof)
In Ruby, you can set up a callback with
rb_add_event_hook… I couldn’t find any documentation about it, but this is how it looks.
rb_add_event_hook(prof_event_hook, RUBY_EVENT_CALL | RUBY_EVENT_RETURN | RUBY_EVENT_C_CALL | RUBY_EVENT_C_RETURN | RUBY_EVENT_LINE, self);
static void prof_event_hook(rb_event_flag_t event, VALUE data, VALUE self, ID mid, VALUE klass)
PyEval_SetTrace in Python, but in a more flexible form – you can choose which events you want to be notified about (for example, “only function calls”).
Disadvantages of tracing profilers
The main drawback of tracing profilers implemented in this way is that they add a fixed amount of code for each function / line call that is executed. It can make you make the wrong decisions! For example, if you have two implementations of something – one with a lot of function calls and one without, which usually take the same amount of time, then the first, with a lot of function calls, will appear slower when profiling.
To show it clearly, I have created a small file called
test.py with the following content and compared the execution time
python -mcProfile test.py and
python. test.py completed in about 0.6 s, and
python -mcProfile test.py – in about 1 s. So for this particular example
cProfile added an additional ~ 60% overhead.
In the documentation
Python’s interpreted nature adds so much run-time overhead that deterministic profiling tends to add little processing overhead in regular applications
This seems like a pretty reasonable statement – the previous example (which makes 3.5 million function calls and nothing else) is obviously not a regular Python program, and almost any other program will have less overhead.
I have not checked the overhead
ruby-prof (Ruby tracing profiler), but his README says the following:
Most programs will run about half as slow, while highly recursive programs (such as the Fibonacci series test) will run three times slower…
How sampling profilers generally work: setitimer
Time to talk about the second kind of profiler: sampling profilers!
Most Ruby and Python sampling profilers are implemented using the system call
setitimer… What it is?
Let’s say you want to take a snapshot of the program stack 50 times per second. This can be done as follows:
- Ask the Linux kernel to send you a signal every 20 milliseconds (using the system call
- Register a signal handler for a stack snapshot when a signal is received;
- When profiling is complete, ask Linux to stop signaling you and provide the result!
If you want to see a practical use case
setitimer to implement a sampling profiler, I think that
stacksampler.py – the best example is a useful, working profiler, and it is about 100 lines in Python. And this is very cool!
Reason for which
stacksampler.py takes only 100 lines in Python, is that when you register a Python function as a signal handler, the function is pushed onto the current stack of your program. Therefore the signal handler
stacksampler.py registration is very simple:
def _sample(self, signum, frame): stack =  while frame is not None: stack.append(self._format_frame(frame)) frame = frame.f_back stack = ';'.join(reversed(stack)) self._stack_counts[stack] += 1
It simply pops a stack from a frame and increments the number of times that a particular stack has been viewed. So simple! So cool!
Let’s take a look at all of our other profilers that use
setitimer, and find out where in the code they call
perftools.rb(Ruby): in this patch that seems to apply when gem compiled (?)
The important thing about
setitimer – you need to decide how to count the time. Do you want 20ms real time? 20ms user cpu time? 20ms user + system cpu time? If you look closely at the links above, you will notice that these profilers actually use differently
setitimer – sometimes behavior is customizable and sometimes not. Man page
setitimer short and worth reading for all possible configurations.
One INTERESTING drawback of profilers based on
setitimer – what timers trigger signals! Signals sometimes interrupt system calls! System calls sometimes take a few milliseconds! If you take snapshots too often, you can make the program execute system calls indefinitely!
Sampling profilers that don’t use setitimer
There are several sampling profilers that do not use
PyEval_SetProfile(so it’s kind of a tracing profiler), but it doesn’t always collect stack snapshots when invoking the trace callback. Here is the code that picks the snapshot time stack trace… Read more about this solution in this blog… (basically:
setitimerallows you to profile only the main thread in Python)
pyflameprofiles Python code outside of a process using a system call
ptrace… He uses a loop where he takes pictures, sleeps for a certain amount of time and does the same thing again. Here’s a call to wait.
python-flamegraphtakes a similar approach where it starts a new thread in your Python process and gets the stack traces, sleeps and loops all over again. Here’s a call to wait…
All 3 of these profilers take real-time snapshots.
Pyflame blog posts
I’ve spent almost all of my time in this post on profilers other than
pyflamebut it actually interests me the most because it profiles your Python program from a separate process, which is why I want my Ruby profiler to work the same way.
Of course, everything is a little more complicated than I described. I won’t go into details, but Evan Klitzke has written many good articles about this on his blog:
- Pyflame: Uber Engineering’s Ptracing Profiler for Python – introduction to pyflame;
- Pyflame Dual Interpreter Mode – how it supports both Python 2 and Python 3 at the same time;
- An Unexpected Python ABI Change – about adding support for Python 3.6;
- Dumping Multi-Threaded Python Stacks;
- Pyflame packages;
Про интересные проблемы с ptraceand system calls in Python;
- Using ptrace for fun and profit, ptrace (continued);
More information can be found at eklitzke.org… These are all very interesting things, which I am going to read more carefully – perhaps
ptrace turns out to be better than
process_vm_readv to implement a Ruby profiler! Have
process_vm_readv less overhead since it doesn’t stop the process, but it can also give you a bad shot as it doesn’t stop the process :). In my experiments, getting conflicting shots was not a big problem, but I think that here I will conduct a series of experiments.
That’s all for today!
There are many important subtleties that I did not go into in this post – for example, I basically said that
stacksampler – are similar (they are not –
vmprof supports string profiling and profiling of Python functions written in C, which I believe makes the profiler more complex). But they have some of the same basic principles, and so I think today’s review will be a good starting point.
- 5 visualizations to improve your Data Story
- An introduction to Python Functools
- Why you should start using FastAPI right now