regexp – big races

“If computers become too powerful, we can organize them into committees. That will finish them off” (c) unknown author

Big races regexp

Big races regexp

Introduction

Most developers have had to deal with regular expressions one way or another. My first acquaintance was with the implementation of regexp in STL std::regexp. Most often, regular expressions are used in checking input data, something like checking the correctness of the user-entered URL, IPv4 address, IPv6 address, telephone number, and at the same time, the speed of the regex operation does not greatly affect the response time of the application. But what if you have to test hundreds, thousands, or even tens of thousands of rules, all against constantly changing sets of input data in real time? In this situation, you don't just need a fast algorithm, you need the best one, you need a champion!

Selecting competition participants

The selection criteria for participants are quite simple:

  • availability of source codes

  • compatibility according to the rules themselves, at least at a basic level

  • the code is written in C/C++

Hidden text

Yes, I’m a C++ fan, so I don’t consider regexp in languages ​​other than C/C++, sorry

So, googling Using Yandex we find the following applicants:

Methodology

Now we need to decide how we will check the participants.

  1. There is a pre-compiled list of regular expressions that is the same for all participants.

  2. There is a pre-compiled list of data sets that is the same for all participants.

  3. To align the time measurements, we will iteratively check each (of NR) rule IR once per ID iterations of data sets (from ND), i.e. the number of checks for each participant will be: SUM = IR * NR * ID * ND

  4. To reduce the mutual influence of participants on each other, we will collect separate applications for each participant.

  5. To maintain the relevance of the project, the source code of the participant is cloned from his repository, from the master branch or the tolerant main branch.

  6. If the participant is a library, then its assembly is carried out.

  7. The actual code that performs the check consists of a class with two abstract methods – one for preparing rules (including their compilation) and one for directly checking the compiled rules on the data set.

  8. We will run the races on a virtual machine in the cloud

Great race

The code is provided in the repository https://github.com/shvit/regexp_checker

For those who want to run the race on their own computer, you will need to install dependencies. For Ubuntu 20.04 (also suitable for ubuntu-latest):

sudo apt install build-essential cmake ragel git libboost-all-dev

git clone https://github.com/abseil/abseil-cpp abseil-cpp && cd abseil-cpp && mkdir -p bin && cd bin && cmake -DCMAKE_CXX_STANDARD=17 -DCMAKE_POSITION_INDEPENDENT_CODE=ON .. && cmake --build . && sudo cmake --install .

Now we clone the race onto our computer and run it

git clone https://github.com/shvit/regexp_checker
cd regexp_checker
make check

It’s easier, of course, to use the ready-made result of running it on a virtual machine in GitHub Actions:

As a check, instead of unit tests, all participants produced the same number of matches – according to 2'800'000.

Checks were carried out with the following parameters: NR=10, IR=20'000, ND=5, ID=10. Total number of checks per participant SUM=10'000'000.

Conclusion

With the current composition of participants, the undisputed champion is the project hyperscanbut also pleased with the participant who took second place – boost.

It should also be noted that projects tiny-regex-c And ximtech support a reduced set of regexp elements, which is why the list of rules had to be significantly reduced. Project ximtech is a development of the project tiny-regex-c. Also in the project tiny-regex-c there are bugs and also it does not allow compiling more than one rule due to its architecture.

To use the project google-re2 Libraries will need to be installed Abseil and also a non-trivial linking of the finished project will be required.

In conclusion, I would like to note that the project has hyperscan the ability to apply rules on streams, which in some cases can be decisive.

If anyone knows other potential participants in the race, please suggest in the comments.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *