regexp – big races
“If computers become too powerful, we can organize them into committees. That will finish them off” (c) unknown author
Introduction
Most developers have had to deal with regular expressions one way or another. My first acquaintance was with the implementation of regexp in STL std::regexp
. Most often, regular expressions are used in checking input data, something like checking the correctness of the user-entered URL, IPv4 address, IPv6 address, telephone number, and at the same time, the speed of the regex operation does not greatly affect the response time of the application. But what if you have to test hundreds, thousands, or even tens of thousands of rules, all against constantly changing sets of input data in real time? In this situation, you don't just need a fast algorithm, you need the best one, you need a champion!
Selecting competition participants
The selection criteria for participants are quite simple:
availability of source codes
compatibility according to the rules themselves, at least at a basic level
the code is written in C/C++
Hidden text
Yes, I’m a C++ fan, so I don’t consider regexp in languages other than C/C++, sorry
So, googling Using Yandex we find the following applicants:
Methodology
Now we need to decide how we will check the participants.
There is a pre-compiled list of regular expressions that is the same for all participants.
There is a pre-compiled list of data sets that is the same for all participants.
To align the time measurements, we will iteratively check each (of
NR
) ruleIR
once perID
iterations of data sets (fromND
), i.e. the number of checks for each participant will be:SUM = IR * NR * ID * ND
To reduce the mutual influence of participants on each other, we will collect separate applications for each participant.
To maintain the relevance of the project, the source code of the participant is cloned from his repository, from the master branch or the tolerant main branch.
If the participant is a library, then its assembly is carried out.
The actual code that performs the check consists of a class with two abstract methods – one for preparing rules (including their compilation) and one for directly checking the compiled rules on the data set.
We will run the races on a virtual machine in the cloud
Great race
The code is provided in the repository https://github.com/shvit/regexp_checker
For those who want to run the race on their own computer, you will need to install dependencies. For Ubuntu 20.04 (also suitable for ubuntu-latest
):
sudo apt install build-essential cmake ragel git libboost-all-dev
git clone https://github.com/abseil/abseil-cpp abseil-cpp && cd abseil-cpp && mkdir -p bin && cd bin && cmake -DCMAKE_CXX_STANDARD=17 -DCMAKE_POSITION_INDEPENDENT_CODE=ON .. && cmake --build . && sudo cmake --install .
Now we clone the race onto our computer and run it
git clone https://github.com/shvit/regexp_checker
cd regexp_checker
make check
It’s easier, of course, to use the ready-made result of running it on a virtual machine in GitHub Actions:
As a check, instead of unit tests, all participants produced the same number of matches – according to 2'800'000
.
Checks were carried out with the following parameters: NR=10, IR=20'000, ND=5, ID=10
. Total number of checks per participant SUM=10'000'000
.
Conclusion
With the current composition of participants, the undisputed champion is the project hyperscan
but also pleased with the participant who took second place – boost
.
It should also be noted that projects tiny-regex-c
And ximtech
support a reduced set of regexp elements, which is why the list of rules had to be significantly reduced. Project ximtech
is a development of the project tiny-regex-c
. Also in the project tiny-regex-c
there are bugs and also it does not allow compiling more than one rule due to its architecture.
To use the project google-re2
Libraries will need to be installed Abseil
and also a non-trivial linking of the finished project will be required.
In conclusion, I would like to note that the project has hyperscan
the ability to apply rules on streams, which in some cases can be decisive.
If anyone knows other potential participants in the race, please suggest in the comments.