Are Python Function Calls Still Slow? An Analysis of Recent CPython Optimizations

I came across post on X/Twitterwhere Pritam found that his Leetcode solution was slower when he used the built-in min function, performance improved when he implemented min in his Python code.

It's true that function calls can be expensive, and they're known to be even more expensive in interpreted languages ​​like Python. The standard recommendation is to use function inlining if they're part of the bottleneck.

The author of this screenshot was using Python 2, which is ancient at this point. Python 3 has seen a lot of releases over the last 10 years, and the latest versions have focused on improving the language's performance. So do function calls still really have such a big impact on Python's performance?

I got curious and wrote three micro-benchmarks to test three cases:

  • Effect of calling a built-in function in a loop

  • Effect of calling a Python function in a loop

  • Effect of inlining a function in a loop

As expected, the results show that the latest releases have significantly improved CPython's performance in all three cases.

In this article, I'm going to discuss specific improvements made to CPython that improve the interpreter's performance. I'll look at the reasons why older versions were slow, and how the new features help fix the situation. Let's dive into the details.

Benchmark 1: Testing the speed of executing simple instructions in a loop

Let's start with the first benchmark, where we perform simple calculations inside a loop, such as calculating the minimum of a list of values. The code is below. It uses a while loop instead of a for loop, because the code from the Twitter post used a while loop and I wanted to keep that.

def benchmark1(heights):
    a = 1
    b = len(heights) - 1
    min_height = heights[0]
    while a < b:
        if heights[a] < min_height:
            min_height = heights[a]
        a += 1

    return min_height

Below are the performance numbers for the first benchmark in recent versions of CPython:

The execution time of calculating the minimum of a list of values ​​in a loop without calling any functions.

The execution time of calculating the minimum of a list of values ​​in a loop without calling any functions.

This benchmark simply measures the speed of performing simple calculations, in this case comparing two integers inside a loop. As we can see, the interpreter has been significantly sped up with the latest releases. Now let's discuss what internal optimizations have increased the execution speed.

The emergence of superinstructions

One of the simple optimizations introduced in CPython is superinstructions. These are special bytecode instructions that are created by combining two consecutive instructions of certain types that often appear in pairs in programs. Let's see how this works in the context of this particular benchmark.

The image below shows the bytecode for the loop body of this benchmark for Python 3.14.0a0 (left) and Python 3.10 (right). In the loop, the interpreter needs to load the values ​​multiple times heights[a] And min_height onto the stack before comparing them. To load these values ​​onto the stack, the interpreter uses the instruction LOAD_FAST.

You can easily notice the difference in bytecode for different versions of Python. In version 3.10, the bytecode contains two consecutive instructions LOAD_FASTwhereas in version 3.14 they are replaced by one instruction LOAD_FAST_LOAD_FAST.

This is an example of a superinstruction. It is created by the compiler during optimization after the initial bytecode of the program is generated. Below is the code for this optimization in CPython, which was introduced in release 3.13.

static int
insert_superinstructions(cfg_builder *g)
{
    for (basicblock *b = g->g_entryblock; b != NULL; b = b->b_next) {

        for (int i = 0; i < b->b_iused; i++) {
            cfg_instr *inst = &b->b_instr[i];
            int nextop = i+1 < b->b_iused ? b->b_instr[i+1].i_opcode : 0;
            switch(inst->i_opcode) {
                case LOAD_FAST:
                    if (nextop == LOAD_FAST) {
                        make_super_instruction(inst, &b->b_instr[i + 1], LOAD_FAST_LOAD_FAST);
                    }
                    break;
                case STORE_FAST:
                    switch (nextop) {
                        case LOAD_FAST:
                            make_super_instruction(inst, &b->b_instr[i + 1], STORE_FAST_LOAD_FAST);
                            break;
                        case STORE_FAST:
                            make_super_instruction(inst, &b->b_instr[i + 1], STORE_FAST_STORE_FAST);
                            break;
                    }
                    break;
            }
        }
    }
    int res = remove_redundant_nops(g);
    assert(no_redundant_nops(g));
    return res;
}

Benefits of Superinstructions

The main benefit of this optimization is that it reduces the amount of work performed by the interpreter. Interpreting each instruction requires fetching the next opcode, decoding it, and jumping to the code that implements that bytecode instruction. The cost of this is small, but if the code is executed frequently, the effect can be significant.

It also allows the processor to optimize the interpretation loop – reducing the number of instructions results in fewer transitionswhich in turn leads to better cache locality and increased efficiency. transition predictor by freeing up the transition table for other entries.

Moreover, the implementation of the instruction LOAD_FAST_LOAD_FAST in the interpreter allows the processor to increase the speed of its execution, since it includes several independent instructions that can be executed in parallel.

Implementation of the LOAD_FAST_LOAD_FAST instruction in CPython. Loading two values ​​onto the stack at the same time allows the processor to execute these instructions in parallel thanks to ILP.

Implementation of the LOAD_FAST_LOAD_FAST instruction in CPython. Loading two values ​​onto the stack at the same time allows the processor to execute these instructions in parallel thanks to ILP.

Bytecode instruction specialization

The second optimization that significantly improves performance for this benchmark is instruction specializationintroduced in the CPython 3.12 release.

If you look again at the bytecode from the previous section, you will notice that the interpreter needs to execute the following many times: COMPARE_OP And BINARY_OP to perform comparison and increment operations inside a loop.

These instructions are relatively expensive to execute because they involve dynamic dispatch. I looked at what exactly is going on under the hood in the article “How Many Lines of C it Takes to Execute a + b in Python?” Below is a brief summary.

When the interpreter needs to process instructions such as BINARY_OP or COMPARE_OPit gets their operands from the stack. The interpreter doesn't know the specific types of these operand objects, whether they're integers, strings, floating point numbers, or anything else, and therefore doesn't know how to handle this case. The interpreter figures out how to handle the operation by looking up the function pointer inside the operand objects. But that requires a lot of pointer lookups.

  • First, the interpreter needs to dereference the operand object.

  • Then you need to dereference the pointer to the PyTypeObject field (ob_type), which contains tables of function pointers.

  • The interpreter then needs to dereference the function pointer table and find the function pointer.

  • Finally, the function pointer itself must be dereferenced to call the function.

The following illustration shows this pointer search process.

The number of abstraction levels associated with the execution of a binary or comparison operation by a bytecode interpreter.

The number of abstraction levels associated with the execution of a binary or comparison operation by a bytecode interpreter.

This level of abstraction has a negative impact on the processor's performance because all of these pointer dereferences depend on memory accesses. This means that the processor must wait for the first memory access to complete before it can proceed with the next one. This reduces the throughput of instructions, and if any of these memory accesses result in a cache miss, it can cause a long delay of hundreds of cycles until the data is retrieved from main memory.

But thanks to specialization, slow instructions BINARY_OP And COMPARE_OP are transformed into specialized ones, for example BINARY_ADD_INTwhere the addition operation is performed directly in the interpreter without performing jumps to pointers.

Benchmark 2: Testing execution speed using built-in function

We will slightly modify the previous benchmark. Instead of calculating the minimum ourselves, we will use the built-in min function. The benchmark code is given below:

def benchmark2(heights):
    a = 1
    b = len(heights) - 1
    min_height = heights[0]
    while a < b:
        min_height = min(heights[a], min_height)
        a += 1

    return min_height

This benchmark measures the cost associated with calling a built-in function. The following table shows how CPython has improved in performance in newer versions.

There are two changes that reduce the execution time from 17.33 seconds in Python 3.10 to 6.7 seconds in Python 3.14.0a0. Let's discuss them.

Accelerating loading of global objects

Let's look at the bytecode of this benchmark.

Bytecode of the second benchmark in CPython 3.14.0a0.

Bytecode of the second benchmark in CPython 3.14.0a0.

When this code is executed, the interpreter needs to load the built-in function min() into the stack. To do this, it uses the instruction LOAD_GLOBAL.

Instructions LOAD_GLOBAL need to find a global named object in two dictionaries. The first dictionary contains all global variables in the current scope, and the second contains all built-in functions.

Dictionary searches are fast, but not instantaneous. Thanks to instruction specialization, the interpreter optimizes this into an instruction LOAD_GLOBAL_BUILTIN.

This specialized instruction caches the index of an object in a dictionary of built-in functions. This allows you to avoid the entire dictionary lookup process and simply return the object at the cached index. The following illustration shows how the interpreter implements LOAD_GLOBAL_BUILTIN.

Implementation of LOAD_GLOBAL_BUILTIN in the CPython bytecode interpreter.

Implementation LOAD_GLOBAL_BUILTIN in the CPython bytecode interpreter.

Optimization of built-in min and max functions

Instruction specialization LOAD_GLOBAL V LOAD_GLOBAL_BUILTIN is not the main reason for the impressive performance of recent Python versions on this benchmark. The real reason is specific optimizations applied to built-in functions min And max.

The interpreter has two conventions for calling functions: the old one is tp_call and the new one – vectorcall.

When using tp_call intermediate tuples and dictionaries are created to pass function arguments, which may incur the cost of creating other intermediate objects (this is described in more detail in PEP 0590). When using vectorcall The arguments are passed as part of a vector, eliminating the need to create multiple intermediate objects.

Before CPython 3.13, built-in functions min And max were called using tp_call. This resulted in frequent function calls resulting in the allocation and release of many intermediate objects. The transition to vectorcall improved the performance of these built-in functions by up to 200%, and our benchmark shows an improvement of over 150%.

For more context, you may want to check out PR this change.

Benchmark 3: Testing execution speed using Python function

Finally, let's discuss the changes that resulted in the performance improvement in the third benchmark, which implements min as a Python function and calls it inside a loop. The code is shown below.

def pymin(a, b):
    if a <= b:
        return a
    return b
	

def benchmark3(heights):
    a = 1
    b = len(heights) - 1
    min_height = heights[0]
    while a < b:
        min_height = pymin(heights[a], min_height)
        a += 1

    return min_height

Comparison of the results of this benchmark for the latest CPython releases:

According to the data in the table, performance improved significantly from 3.10 to 3.12, and then improved slightly for 3.14.0a0.

This benchmark essentially measures the cost of executing a Python-to-Python function call (since both the calling and called functions are implemented in Python).

Before 3.11, handling Python-to-Python function calls in the interpreter was complex and expensive, as it involved recursively calling the interpreter itself to handle each such call. This recursion was built into CPython 3.11, resulting in significant performance gains. Let's look at this in more detail.

Inlining Python-to-Python Function Calls

The interpreter starts executing a Python program from the main function. First, the stack frame of the main function is prepared, after which the interpreter is called. The entry point to the interpreter is the function _PyEval_EvalFrameDefaultdefined in ceval.c.

What is a stack frame? Each function in your code requires a corresponding frame to execute. It contains the local and global variables of that function, the compiled bytecode, an instruction pointer, and other data that helps the interpreter execute the code. For more information, you can watch a recording of my lecture on the internals of the CPython virtual machine.

Function _PyEval_EvalFrameDefault contains a huge switch statement to process all bytecode instructions supported by the interpreter. The function iterates over the instructions of the received function and performs the corresponding operations.

An example of a bytecode processing cycle in a simple Python virtual machine. The interpreter goes through each bytecode instruction and processes it based on the opcode. The CPython virtual machine is implemented in C, this is just for illustration.

An example of a bytecode processing cycle in a simple Python VM. The interpreter goes through each bytecode instruction and processes it based on the opcode. The CPython VM is implemented in C, this is just for illustration.

When you call another function in your Python code, it results in the generation of a bytecode instruction. CALL. This is where things get interesting.

In CPython 3.10 and earlier, the instruction CALL created a new stack frame for the called function, and then recursively returned to the interpreter, calling its entry point _PyEval_EvalFrameDefault.

This negatively impacts performance in many ways. Recursively invoking the interpreter required saving the current function's registers and creating a new stack frame. This increased memory usage, since each recursive interpreter invocation allocated its own local variables on the stack and performed other allocations on the heap. It also resulted in poor instruction cache locality due to constant jumping in and out of the bytecode evaluation loop.

In release 3.11 this was fixed by eliminating the recursive interpreter call. Now the instruction CALL simply creates a stack frame for the called function, and then immediately begins executing the new function's bytecode without leaving the loop.

You can read the discussion about this change at CPython bugtracker.

Specialization of the CALL instruction

While the main performance improvement of this benchmark is due to the function inlining described above, there is another small improvement related to the execution of function calls in the interpreter – instruction specialization CALL.

Instructions CALL is universal for executing all types of called objects. When processing it, the interpreter needs to check the type of the called object, for example, whether it is a class method, an instance method, a function, or something else, and based on this, correctly call the desired object.

Specializing this instruction avoids all this extra work for the interpreter, which can improve performance in short loops.

Conclusion

We took a detailed look at Python performance, focusing on the cost of function calls, inline function calls, and code inlining. Our benchmarks showed significant performance improvements from Python 3.10 to Python 3.14.0a0. Here's a quick overview of the reasons for these improvements:

  • Super instructions: By combining successive bytecode instructions into single superinstructions, such as LOAD_FAST_LOAD_FASTCPython reduces the cost of executing individual bytecode instructions. This improvement helps both the interpreter and the processor work more efficiently.

  • Bytecode instruction specialization: New specialized bytecode instructions (e.g. BINARY_ADD_INT) remove the need for slow dynamic dispatching, speeding up common operations.

  • Optimization of built-in functions: Transition from the old method tp_call to faster vectorcall significantly improved the performance of built-in functions min And max.

  • Inlining Python-to-Python function calls: By eliminating the old way of handling Python-to-Python function calls (which involved cumbersome recursive interpreter calls), new versions since CPython 3.11 have made these calls faster.

Overall, these changes show consistent improvements in Python's speed and efficiency. However, before using the results from this article, remember to profile it first to find the slowest parts of your code (Amdahl's law).

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *