Python code generation – speeding up strftime

Hello! In the first and second parts, I shared the history of creating a python library convtools (briefly: allows you to declaratively describe data transformations from which python functions are generated that implement the given transformations), now I will talk about the acceleration of special cases datetime.strptime And datetime.strftimeas well as about interesting things that happened in the datetime module along the way.

strftime: datetime/date -> str

To begin with, let’s take measurements of the basic date / date and time formatting option:

from datetime import datetime

dt = datetime(2023, 8, 1)
assert dt.strftime("%b %Y") == "Aug 2023"

# In [2]: %timeit dt.strftime("%b %Y")
# 1.21 µs ± 4.02 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Looking at the code above and also looking at the sources strftimeyou can find the following problems:

  • on every function run strftime the interpreter parses the date format string from scratch (without using any intermediate developments from previous iterations)

  • date is pre-converted to timetuplewhich can already work time.strftime. But since timetuple contains all the components of a date and time, then to create it, the interpreter did extra work, probing hours, minutes, seconds and microseconds, which in this particular case did not interest us at all.

Now let’s check how fast the most narrow-minded function that implements almost the same date formatting could work (almost, because it ignores the different name of the months depending on the locale):

from datetime import datetime

MONTH_NAMES = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

def ad_hoc_func(dt):
  return f"{MONTH_NAMES[dt.month - 1]} {dt.year:04}"

dt = datetime(2023, 8, 1)
assert ad_hoc_func(dt) == "Aug 2023"

# In [11]: %timeit ad_hoc_func(dt)
# 258 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

We got an acceleration of ~ 4.7 times and a problem statement for convtools – be able to dynamically generate highly specialized converters for a given date format.

Before we examine the result, I will make some remarks:

  • dt.strftime("%Y-%m-%d") use is not optimal. Better to use dt.date().isoformat() For datetime And dt.isoformat() For date (acceleration 5.5x and 6.4x respectively) — taken into account in the implementation

  • dt.strftime("%Y"): the documentation does not make any reservations about the formatting of the year (at least for python 3.4+), but CPython bugtracker does (#57514) – under linux glibc python zero padding is not done (on poppy and linux musl is done). Can be cured like this dt.strftime("%4Y")but break for the rest – taken into account in the implementation

  • many format codes such as %a, %b, %c, %p depend on the locale installed on the system (for example: Sunday, Monday, …, Saturday for en_US and Sonntag, Montag, …, Samstag for de_DE) — only a part of such codes is implemented, when an unsupported one is encountered, the built-in code is used strftime.

from convtools import conversion as c

ad_hoc_func = c.format_dt("%b %Y").gen_converter()
assert ad_hoc_func(dt) == "Aug 2023"

# In [32]: %timeit ad_hoc_func(dt)
# 274 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

On the way, we lost a little, but still we have an acceleration of 4.4 times from the base version. To see what code was generated under the hood, just run with debug=True(try-except the binding dumps the generated code in tmp in case of an error for the sake of beautiful tracebacks and normal debugging):

In [34]: c.format_dt("%b %Y").gen_converter(debug=True)
def converter(data_, *, __v=__naive_values__["__v"], __datetime=__naive_values__["__datetime"]):
    try:
        return f"{__v[data_.month - 1]} {data_.year:04}"
    except __exceptions_to_dump_sources:
        __convtools__code_storage.dump_sources()
        raise

Out[34]: <function _convtools.converter(data_, *, __v=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], __datetime=<class 'datetime.datetime'>)>

strptime: str -> datetime

Repeat the steps above for datetime.strptimebut cutting corners a bit. Looking ahead, I note that we will not optimize the work with format codes that depend on the locale.

from datetime import datetime

assert datetime.strptime("12/31/2020 12:05:54 PM", "%m/%d/%Y %I:%M:%S %p") == datetime(2020, 12, 31, 12, 5, 54)

# In [37]: %timeit datetime.strptime("Aug 2023", "%b %Y")
# 2.93 µs ± 3.51 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

After examining the sources, we find a cache for compiled regular expressions, locks for accessing it, in general, everything is fine, but still comparable to highly specialized code.

from datetime import datetime
from convtools import conversion as c

ad_hoc_func = c.datetime_parse("%m/%d/%Y %I:%M:%S %p").gen_converter()
assert ad_hoc_func("12/31/2020 12:05:54 PM") == datetime(2020, 12, 31, 12, 5, 54)

# In [44]: %timeit ad_hoc_func("12/31/2020 12:05:54 PM")
# 1.29 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

We have an increase in speed by 2.3 times – tangibly. Let’s look at the code generated under the hood:

In [46]: c.datetime_parse("%m/%d/%Y %I:%M:%S %p").gen_converter(debug=True)
def converter(data_, *, __v=__naive_values__["__v"], __datetime=__naive_values__["__datetime"]):
    try:
        match = __v.match(data_)
        if not match:
            raise ValueError("time data %r does not match format %r" % (data_, """%m/%d/%Y %I:%M:%S %p"""))
        if len(data_) != match.end():
            raise ValueError("unconverted data remains: %s" % data_string[match.end() :])
        groups_ = match.groups()
        i_hour = int(groups_[3])
        ampm_h_delay = 12 if groups_[6].lower() == """pm""" else 0
        return __datetime(int(groups_[2]), int(groups_[0]), int(groups_[1]), i_hour % 12 + ampm_h_delay, int(groups_[4]), int(groups_[5]), 0)
    except __exceptions_to_dump_sources:
        __convtools__code_storage.dump_sources()
        raise

Out[46]: <function _convtools.converter(data_, *, __v=re.compile('(1[0-2]|0[1-9]|[1-9])/(3[0-1]|[1-2]\\d|0[1-9]|[1-9]| [1-9])/(\\d{4})\\ (1[0-2]|0[1-9]|[1-9]):([0-5]\\d|\\d):(6[0-1]|[0-5]\\d|\\d)\\ (am|pm)', re.IGNORECASE), __datetime=<class 'datetime.datetime'>)>

Price

For everything you have to pay something, in the case of convtools this is the time for code generation and compilation of converters:

In [47]: %timeit c.format_dt("%b %Y").gen_converter()
54.8 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [48]: %timeit c.datetime_parse("%m/%d/%Y %I:%M:%S %p").gen_converter()
99.7 µs ± 67.3 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Therefore, before using this functionality, you need to consider whether your case falls under one of the following:

  1. date format is static (known at the time of writing the code) and you can call once gen_converter somewhere globally and then use it

  2. the date format is dynamic, but the generated converter will be used to process, say, at least 1K (thousands) of dates.

Conclusion

The above functionality is far from the only thing that the library offers. convtoolsFor more details, please see the links below:

I would be grateful for feedback, ideas in discussions on Github.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *