Why are inclusions faster than cycles?

If you write Python and are familiar with the various includes, you've probably heard that creating a collection using an include is usually faster than creating the same collection using a regular loop. for. I've been writing Python for a few years, and of course I've heard about the performance of inclusions, too. It's always been an axiom for me, a truth that doesn't require verification. However, this is the wrong way to approach learning hard science and technology, so I sat down to figure it out.

Performance

The first thing I decided to start with was to understand whether various types of inclusions really work faster when creating collections than when creating collections using loops and direct use of the insert method. To solve this problem, it was necessary to collect data on the running time of both approaches, build some dependencies if possible, and visually compare the obtained dependencies. Of course, I started with collecting data.

The following methodology was used to collect the working times of the approaches:

The following collections were considered: list, dict, set;
The creation times of collections ranging in size from elements up to elements with a step of elements. These values were chosen to achieve a compromise between the number of elements, the total data collection time and the required amount of memory;
For each collection size in the specified range, the time to create a collection with a given number of elements was measured using a loop and the native insert method, as well as using the corresponding inclusion;
The creation time was measured for each method separately using the built-in Python library. time;
All measurements were placed in a special structure;
After considering all the collection size values from the range specified above, the obtained measurements were saved to files in json format.

To collect times, I implemented several auxiliary functions that can be launched simultaneously in three processes to speed up the accumulation of data. All the code I wrote to prepare the materials can be viewed Here. It is worth mentioning that collecting runtime is associated with hardware dependency, as well as dependency on the version of the interpreter used. I was limited by the following hardware and software dependencies:

At the next stage, I visualized the results and tried to evaluate the possibility of constructing some dependencies. In the case of lists, it was possible to construct a linear dependence of the creation time on the number of elements using the least squares method. It was also possible to calculate the error corridor, which fit a significant part of the experimental readings.

Dependence of list creation time on the required number of elements

In the case of dictionaries and sets, the graphs turned out to be very discontinuous, so I did not bother with constructing a trend for them. The graphs for sets and dictionaries themselves are structurally similar to each other, which is explained by the similarity of these collections in terms of implementation.

Experimental data from the times of creation of dictionaries and sets

The result is that creating collections using includes is indeed faster than creating collections using loops and native insertion methods.

Cause

So, when the question of which method is faster was answered unequivocally – the method with inclusions, I was faced with the question of the genesis of the process. Why do inclusions work faster? After all, at first glance, inclusions are just syntactic sugar.

To answer this question I used a standard library module dis. With this module, you can disassemble Python bytecode and analyze the actual sequence of actions that the interpreter performs to execute a particular code fragment. I will immediately stipulate that disassembling reveals some implementation details, and therefore the output of library functions for different versions of the interpreter may differ. Differ greatly – up to the absence/presence of certain commands. I am using Python 3.11.1, so if your version does not match mine, the output may also not match. However, there should be no conceptual differences.

From now on, I will only disassemble the code for creating lists. Disassembling the methods for creating other collections leads to similar conclusions, so analyzing the bytecode for dictionaries and sets is redundant. Let's start by disassembling the process of creating a list using a loop for and explicit method call append:

Disassembly code

import dis

loop = """\
lsp = []
for i in range(10):
    lsp.append(i)
"""

dis.dis(loop)

Conclusion

  0           0 RESUME                   0

  1           2 BUILD_LIST               0
              4 STORE_NAME               0 (lsp)

  2           6 PUSH_NULL
              8 LOAD_NAME                1 (range)
             10 LOAD_CONST               0 (10)
             12 PRECALL                  1
             16 CALL                     1
             26 GET_ITER
        >>   28 FOR_ITER                23 (to 76)
             30 STORE_NAME               2 (i)

  3          32 LOAD_NAME                0 (lsp)
             34 LOAD_METHOD              3 (append)
             56 LOAD_NAME                2 (i)
             58 PRECALL                  1
             62 CALL                     1
             72 POP_TOP
             74 JUMP_BACKWARD           24 (to 28)

  2     >>   76 LOAD_CONST               1 (None)
             78 RETURN_VALUE

If you want to study what is happening in detail, you can read the meaning of all the commands above in official documentation. I want to analyze in detail only the sequence of commands that is used at each iteration of the loop. This is the block of commands with addresses 30-74:

             30 STORE_NAME               2 (i)

  3          32 LOAD_NAME                0 (lsp)
             34 LOAD_METHOD              3 (append)
             56 LOAD_NAME                2 (i)
             58 PRECALL                  1
             62 CALL                     1
             72 POP_TOP
             74 JUMP_BACKWARD           24 (to 28)

So, at each iteration of the loop, we bind the object generated by the iterator to the loop variable. i. Then the interpreter loads the list lstafter which it searches for a method append in the loaded list and pushes the method first onto the top of the stack appendand then the object it is associated with. After that, the loop variable is placed on top of the stack. This order is related to the features of calling functions, or more precisely, to the features of the location of the called object itself and its arguments on the stack. 58-62 preparing and calling a method append. The result of the call is placed on the top of the stack, so the value is removed from the top of the stack next. Well, 74 is the form GOTOi.e. we are moving on to a new iteration.

So, if we consider that the cost of each operation is conventional unit, then the cost of one iteration when adding elements to a list using a loop for and method append makes up conventional units.

Now let's look at list comprehension:

import dis

comp = "[i for i in range(10)]"
dis.dis(comp)

Conclusion

  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (<code object <listcomp> at 0x0000021D9A765210, file "<dis>", line 1>)
              4 MAKE_FUNCTION            0
              6 PUSH_NULL
              8 LOAD_NAME                0 (range)
             10 LOAD_CONST               1 (10)
             12 PRECALL                  1
             16 CALL                     1
             26 GET_ITER
             28 PRECALL                  0
             32 CALL                     0
             42 RETURN_VALUE

Disassembly of <code object <listcomp> at 0x0000021D9A765210, file "<dis>", line 1>:
  1           0 RESUME                   0
              2 BUILD_LIST               0
              4 LOAD_FAST                0 (.0)
        >>    6 FOR_ITER                 4 (to 16)
              8 STORE_FAST               1 (i)
             10 LOAD_FAST                1 (i)
             12 LIST_APPEND              2
             14 JUMP_BACKWARD            5 (to 6)
        >>   16 RETURN_VALUE

Here the code is a little more complicated, because we have before us the result of disassembling two objects: our code itself and the function listcomp. What's happened listcomp? And this is the very same list inclusion. That is, when using the list inclusion syntax, we actually call a special function. This leads to an interesting result: since listcomp – this is a function, we have constant costs for its call. That is, on small volumes of data, list inclusions should be inferior in performance to the loop forHowever, I did not observe such behavior during data collection, but I did observe that at relatively small volumes (about ) list comprehensions and loops for operate at approximately the same level of performance. This can be verified by analyzing collected data.

The code itself, responsible for creating and filling the list, is located here:

  1           0 RESUME                   0
              2 BUILD_LIST               0
              4 LOAD_FAST                0 (.0)
        >>    6 FOR_ITER                 4 (to 16)
              8 STORE_FAST               1 (i)
             10 LOAD_FAST                1 (i)
             12 LIST_APPEND              2
             14 JUMP_BACKWARD            5 (to 6)
        >>   16 RETURN_VALUE

Here we will also only consider the iteration code, i.e. the commands with addresses 8-14. At the beginning we bind the object generated by the iterator to the loop variable ithen we push the loop variable onto the top of the stack. Then we call LIST_APPEND – a special form append to implement list comprehensions. In the team at address 14 we go to a new iteration.

So, if we take the cost of each team as conventional unit, then one iteration of list comprehension costs conventional units. That is, the iteration of a list comprehension costs twice as little as a similar iteration when creating a list using for And append.

In fact, I am being very disingenuous when I say that all operations cost conventional unit. In fact, what can be seen on the graphs (or checked in code), the slope coefficients are related to each other not with the coefficient and with the coefficient i.e. $incline_{loop} = 1.5 \times incline_{comp}$ However, the fact remains that inclusions work faster, and this is due to the fact that inclusions require fewer operations to implement one iteration.