Python code generation – speeding up strftime
Hello! In the first and second parts, I shared the history of creating a python library convtools (briefly: allows you to declaratively describe data transformations from which python functions are generated that implement the given transformations), now I will talk about the acceleration of special cases datetime.strptime
And datetime.strftime
as well as about interesting things that happened in the datetime module along the way.
strftime: datetime/date -> str
To begin with, let’s take measurements of the basic date / date and time formatting option:
from datetime import datetime
dt = datetime(2023, 8, 1)
assert dt.strftime("%b %Y") == "Aug 2023"
# In [2]: %timeit dt.strftime("%b %Y")
# 1.21 µs ± 4.02 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Looking at the code above and also looking at the sources strftime
you can find the following problems:
on every function run
strftime
the interpreter parses the date format string from scratch (without using any intermediate developments from previous iterations)date is pre-converted to
timetuple
which can already worktime.strftime
. But sincetimetuple
contains all the components of a date and time, then to create it, the interpreter did extra work, probing hours, minutes, seconds and microseconds, which in this particular case did not interest us at all.
Now let’s check how fast the most narrow-minded function that implements almost the same date formatting could work (almost, because it ignores the different name of the months depending on the locale):
from datetime import datetime
MONTH_NAMES = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
def ad_hoc_func(dt):
return f"{MONTH_NAMES[dt.month - 1]} {dt.year:04}"
dt = datetime(2023, 8, 1)
assert ad_hoc_func(dt) == "Aug 2023"
# In [11]: %timeit ad_hoc_func(dt)
# 258 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
We got an acceleration of ~ 4.7 times and a problem statement for convtools
– be able to dynamically generate highly specialized converters for a given date format.
Before we examine the result, I will make some remarks:
dt.strftime("%Y-%m-%d")
use is not optimal. Better to usedt.date().isoformat()
Fordatetime
Anddt.isoformat()
Fordate
(acceleration 5.5x and 6.4x respectively) — taken into account in the implementationdt.strftime("%Y")
: the documentation does not make any reservations about the formatting of the year (at least for python 3.4+), but CPython bugtracker does (#57514) – under linux glibc python zero padding is not done (on poppy and linux musl is done). Can be cured like thisdt.strftime("%4Y")
but break for the rest – taken into account in the implementationmany format codes such as
%a
,%b
,%c
,%p
depend on the locale installed on the system (for example: Sunday, Monday, …, Saturday for en_US and Sonntag, Montag, …, Samstag for de_DE) — only a part of such codes is implemented, when an unsupported one is encountered, the built-in code is usedstrftime
.
from convtools import conversion as c
ad_hoc_func = c.format_dt("%b %Y").gen_converter()
assert ad_hoc_func(dt) == "Aug 2023"
# In [32]: %timeit ad_hoc_func(dt)
# 274 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
On the way, we lost a little, but still we have an acceleration of 4.4 times from the base version. To see what code was generated under the hood, just run with debug=True
(try-except the binding dumps the generated code in tmp in case of an error for the sake of beautiful tracebacks and normal debugging):
In [34]: c.format_dt("%b %Y").gen_converter(debug=True)
def converter(data_, *, __v=__naive_values__["__v"], __datetime=__naive_values__["__datetime"]):
try:
return f"{__v[data_.month - 1]} {data_.year:04}"
except __exceptions_to_dump_sources:
__convtools__code_storage.dump_sources()
raise
Out[34]: <function _convtools.converter(data_, *, __v=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], __datetime=<class 'datetime.datetime'>)>
strptime: str -> datetime
Repeat the steps above for datetime.strptime
but cutting corners a bit. Looking ahead, I note that we will not optimize the work with format codes that depend on the locale.
from datetime import datetime
assert datetime.strptime("12/31/2020 12:05:54 PM", "%m/%d/%Y %I:%M:%S %p") == datetime(2020, 12, 31, 12, 5, 54)
# In [37]: %timeit datetime.strptime("Aug 2023", "%b %Y")
# 2.93 µs ± 3.51 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
After examining the sources, we find a cache for compiled regular expressions, locks for accessing it, in general, everything is fine, but still comparable to highly specialized code.
from datetime import datetime
from convtools import conversion as c
ad_hoc_func = c.datetime_parse("%m/%d/%Y %I:%M:%S %p").gen_converter()
assert ad_hoc_func("12/31/2020 12:05:54 PM") == datetime(2020, 12, 31, 12, 5, 54)
# In [44]: %timeit ad_hoc_func("12/31/2020 12:05:54 PM")
# 1.29 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
We have an increase in speed by 2.3 times – tangibly. Let’s look at the code generated under the hood:
In [46]: c.datetime_parse("%m/%d/%Y %I:%M:%S %p").gen_converter(debug=True)
def converter(data_, *, __v=__naive_values__["__v"], __datetime=__naive_values__["__datetime"]):
try:
match = __v.match(data_)
if not match:
raise ValueError("time data %r does not match format %r" % (data_, """%m/%d/%Y %I:%M:%S %p"""))
if len(data_) != match.end():
raise ValueError("unconverted data remains: %s" % data_string[match.end() :])
groups_ = match.groups()
i_hour = int(groups_[3])
ampm_h_delay = 12 if groups_[6].lower() == """pm""" else 0
return __datetime(int(groups_[2]), int(groups_[0]), int(groups_[1]), i_hour % 12 + ampm_h_delay, int(groups_[4]), int(groups_[5]), 0)
except __exceptions_to_dump_sources:
__convtools__code_storage.dump_sources()
raise
Out[46]: <function _convtools.converter(data_, *, __v=re.compile('(1[0-2]|0[1-9]|[1-9])/(3[0-1]|[1-2]\\d|0[1-9]|[1-9]| [1-9])/(\\d{4})\\ (1[0-2]|0[1-9]|[1-9]):([0-5]\\d|\\d):(6[0-1]|[0-5]\\d|\\d)\\ (am|pm)', re.IGNORECASE), __datetime=<class 'datetime.datetime'>)>
Price
For everything you have to pay something, in the case of convtools
this is the time for code generation and compilation of converters:
In [47]: %timeit c.format_dt("%b %Y").gen_converter()
54.8 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [48]: %timeit c.datetime_parse("%m/%d/%Y %I:%M:%S %p").gen_converter()
99.7 µs ± 67.3 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Therefore, before using this functionality, you need to consider whether your case falls under one of the following:
date format is static (known at the time of writing the code) and you can call once
gen_converter
somewhere globally and then use itthe date format is dynamic, but the generated converter will be used to process, say, at least 1K (thousands) of dates.
Conclusion
The above functionality is far from the only thing that the library offers. convtools
For more details, please see the links below:
I would be grateful for feedback, ideas in discussions on Github.