How we catch flaky tests in DBMS Platform V Pangolin. I show the backend of the solution

Red tests are unpleasant, but there is something worse – tests that are sometimes red, sometimes green. Every product encounters flask tests. And the more you test, the more painful it is to figure out which tests are flasks and which are not.

My name is Alexander Milov, I am responsible for testing in Platform V Pangolin — is the main DBMS in Sber, a special PostgreSQL build created for storing and processing data in high-load applications.

We started making Pangolin in 2019. For a long time, flask tests were analyzed manually, and information about them was passed from tester to tester every week. As the number of tests grew, this became impossible (it’s one thing to track 5–10 tests this way, another — 30–50). Now we run 5,000 tests, we have more than four main system deployment configurations, five supported OSs, including 8 versions of these OSs, and a large number (about 160) of special quality gates with different launch parameters. Oh, and in addition to regression after installing the product, we also test it after an update and rollback…

How to track Flaki tests in such volumes? There are ready-made tools, both paid and free, but we needed a solution that would allow not only to conduct analysis in the context of all the above parameters, but also to find patterns. It is not enough to find out that 100 tests fail, you want to see that these 100 tests fail only on one OS and only with a specific installation parameter.

At first, we fought with flaky tests “head-on” – we took and tried to fix. As a rule, everything came down to “Let's make a delay?”, “What if we make an additional restart?” But one way or another, we needed a systematic approach.

How does the tool work?

First, I'll show and briefly tell you what our solution looks like from the front-end, and then we'll go over the back-end.

On the screen below, we select the current version, the name of the test or suite, and the time interval for which statistics are taken. Red is for falls, and green is for successful test completions on a specific date. For each column, you can set up sorting, hide/display additional information or columns.

By clicking on a test, we will go to detailed statistics on the test run. By clicking on one of the red or green squares, we will get more detailed information about the run where it was launched. Here we can also see the log of the failed test, as well as the line in the source code where it failed. Total: with a couple of clicks, we see information about the failure of any test for 30 days (you can choose any period of time, depending on how long the data is stored).

This is only a small part of our service front. Having data on runs, as well as cached logs, we can experiment with any statistics.

Where does this data come from?

Everything is simple here – the results of each test are written to the DB, which is managed by our own DBMS Pangolin. In terms of code, everything depends on the framework you use. On our side, it is unitetest, so it was easy for us to write information about the results to a table in the DB at the end of each test.

Having all the data in a single view, we structure it depending on the version and how exactly the tests were run.

Why is this important? After all, it is enough to make a simple WHERE, and everything will work.

This is where the pitfalls arise.

First. For flaky analysis, we should only be guided by tests run from the develop branch (the reference branch from which the release is shipped). Tests run from branches other than develop should not be taken into account in the flaky calculation – these branches may have errors in the autotests themselves.

Second, we shouldn't just look at the develop branch of automated tests. We need to understand that the same tests from the develop branch could have been run on a custom developer branch.

As a result, to calculate flaky, we must take into account only those runs that were launched on the current develop build and on the develop branch of autotests.

For this we use daily regression – all you need to do is throw a mark into the database.

In the main view (table) we write the main information about the test/run: the name of the test, the start time, the duration, the suite it belongs to (I will describe this in more detail below), the OS, what features the stand is installed with or would be updated with, the build number in Jenkins and, of course, the test execution result.

On the previous screen you saw two columns: Done and Done 2w. This is done so that the table contains only those tests that are currently failing.

Let me give you an example. Our flaky test crashed, which crashes most often. The tester analyzed it, studied the reason for the crash and fixed everything. After this fix, the test stopped crashing and runs successfully all the time (as in the screenshot below):

But if you look at the entire time of the test, it will never disappear from the statistics, because its percentage of completion for the entire time will not be equal to 100%.

That's why we introduced a cutoff in the data. We understand that it is a flaky test if it has fallen at least once during a certain period (in our case, two weeks). As a result: if the fixed test does not fall in the next two weeks, it will automatically disappear from the statistics.

So, for each version of the product we create two views. The first view is for the entire existence of the version, and the second is for a specific period.

In terms of ideas, we chose Materialized View.

Below is the code for the view with a sample of data for the last 14 days. In this case, the run flag is “branch_name ~~ '%6.2.0/develop_night%'”

Having the percentage of test completion (both overall and for the period), we only have to combine the information from different views and get a final table for flakey tests, on the basis of which the information is already displayed on the front.

How is all this managed?

Once a day (it could be more often, but we didn’t have a particular need) all Materialized Views are updated via REFRESH.

All this is done using the extension pg_cron.

In addition to updating views, we delete old test pass data. In this case, we also split tests into develop/other. In terms of non-develop branches, we only save data for the last 30 days. But in terms of develop tests, we save results not for the last time, but for the last 45 days, starting from the last run.

When we create a new version of a product, we run a procedure that automatically adds all the necessary views for the new version.

We haven't forgotten about creating backups either – using the Linux cron utility.

We add not only test results to the main table, but also suite results. This is done in order to calculate the total execution time of a suite (after all, sometimes preparing SetUpClass tests takes several times longer than the tests themselves). Having the average execution time of each suite, we can predict the run time based on the number of suites and parallelize runs based on the average execution time. But that's another story about how we significantly reduced the execution time of our tests. When I get out of the releases, I'll write a sequel. Or, more likely, I'll tell you about it sooner in the community our team, if you are interested, join us.

So what did we get in the end?

First of all, convenience. An engineer now only needs to go to the page, select the test that crashes most often, learn all the details about it and immediately install that stand and on the OS where it crashes most often for debugging.

In addition, knowing that the test is flasks, we can modify the code for launching autotests and re-run the data of this test immediately during the initial run.

Looking at the statistics, we also found a pattern. Certain tests fail only on certain deployment configurations. Based on these results, the edits were no longer in the automated tests, but in the product's source code.

In the future, we plan to build burndown charts for flask tests for each feature, and also use this information when viewing final run reports.

Thank you for your attention! I am ready to answer questions about the solution in the comments.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *