How we created self-service functionality for checking data quality for ML models

Hello! I will continue the story of how we are turning a bank into a “big data” organization. Obviously, the more data a company uses, the more it depends on its quality. But often, insufficient attention is paid to data quality issues when developing storefronts. This is due to the fact that data quality requirements are not fixed in business requirements, and the storefront developer/data engineer does not always have a thorough knowledge of the subject area. The future lies in the organization of control activities among business customers. This trend is called Self-Service functions. At Gazprombank, we use this principle to check data quality for ML models. Every analyst/model developer has access to functionality for assessing the data quality of any storefront. I’ll tell you how we built this scheme of work.

WELL, VERY LARGE DISPLAY DISPLAYS

The quality of business decisions today depends entirely on the quality of the data used. Therefore, the DQaaS (Data Quality as a Service) service is becoming increasingly popular in many companies. We also have such a service. It is based on a software tool for checking data quality, and also comes with a data quality (DQ) engineer who will quickly set up CD checks of the required storefronts, analyze the results and provide conclusions based on the results of the analysis.

The bank’s ML models use big data: each key storefront contains more than 10 million rows. This is information about the number of clients, loan applications, deposits, and so on.

Of course, the models do not work with raw data, but with already aggregated ones, that is, some of the issues with their quality have been resolved at earlier stages. However, we need a method for reconciling data from slice to slice, for example from month to month or from week to week, this is due to the peculiarities of the models. When constructing them, a certain data sample (sample) is used, and the model’s performance indicators and its effectiveness are checked. And it is important to check from slice to slice that the data has not changed dramatically. For example, if the model was built for a sample of “30% men, 70% women, 50% men under the age of 45,” then it should be applied on the same data. But if the new sample already contains 90% men, of which 70% are over 50 years old, the model may not work effectively. A striking example of deviations is the change in the parameters of the sample of cafe visitors during Covid: catering incomes changed dramatically due to isolation, people sharply reduced spending on cafes, bars and offline stores, but at the same time, the volume of online purchases increased several times.

WILL OPEN SOURCE HELP US?

When launching a pilot project, we decided to look for an Open Source tool to check data quality. There are a variety of open source libraries, including Pandas Profiling, Ydata_quality, Deepchecks, Great Expectations, TensorFlow Data Validation. And these are just the ones I’ve personally worked with in Python.

You can learn a lot about your data using these libraries. For example, track drift – situations when training data and output/inference data differ, and the distribution of input features on which the model was trained shifts. See anomalies, minimum-maximum estimates, study built-in reports (a number of projects have beautiful visualizations). And any analyst or engineer can get started with data quality analysis. To do this, you need minimal knowledge of Python: install the library, call it, prepare datasets with data. True, self-service implies the absence of such knowledge.

It turned out that different free software provides approximately the same capabilities, the difference lies in some specific “tricks”. Serious shortcomings have also emerged: some solutions do not work with large volumes of data over 1 million rows.

The second negative point is dependence on external developers: if some functionality is missing, you will have to wait for a library update to appear, or modify it yourself.

Another important aspect is incompatibility with previously installed libraries. If modelers have built models on certain versions of libraries, then when installing data quality libraries, a version conflict may occur.

After weighing all the possibilities and limitations, we decided to develop our own library and our own workflow engine for working with data quality for Self-Service Data Quality (DQ).

FEATURES OF THE ENGINE FOR WORKING WITH DATA QUALITY

We have developed a data quality control process for models, within which you can do custom data quality checks of any complexity with graphical visualization, the ability to connect to any data source, a transparent role-based access model, and email notifications about the results of checks.

We have developed our own set of metrics to assess data quality. They allow you to evaluate key indicators for numerical variables (calculated in absolute and proportional proportions):

number of completed lines;
number of empty lines;
median value;
average value;
minimum value;
maximum value;
total value for the variable per slice;
PSI (Population Stability Index), which shows the composition of a particular field.

The PSI indicator is heavy in calculation, so it is not calculated for all columns of the storefront. An initial assessment is carried out using the above metrics, and only if detail is required, the PSI calculation is started.

For each metric, you can set a confidence interval for the deviation, that is, indicate what is considered an anomaly. For example, for the number of filled lines we set the threshold to 5%. This means that if the number of completed lines changes by less than 5% compared to the previous slice, this is considered normal.

This set of key metrics was formed based on the experience of our specialists. But we, of course, constantly monitor the emergence of innovations in the market, and new metrics will be implemented if necessary.

Additionally, the Isolation Forest model helps identify anomalies. It works well with large amounts of data. The ready-made Python library Scikit-learn, a well-known Python package for Data Science and Machine Learning tasks, is used. It works like this. An arbitrary variable is taken from the sample, the minimum and maximum values are estimated, after which some arbitrary value between the minimum and maximum is determined. Thus, the entire sample is divided into two subsamples, which have the form of an ensemble of trees. Then another variable is taken and the process is repeated until there is only one record left in each leaf of the tree or they are similar. Next, the path from the root of the tree to each leaf is evaluated: the shorter this path, the higher the probability that this is an anomaly.

The great thing about this approach is that it does not require master sampling or training. We can build a showcase, take two years’ worth of history, and immediately apply the algorithm. The disadvantages of the method include the significant subjectivity of the results: what is considered an anomaly is determined by the data quality engineer himself.

That is, the engineer can choose which data assessment tool to use in each specific case.

Overall, a basic set of quality metrics, while seemingly simple, provides a clear way to determine if there are any problems with the data. The Isolution Forest method catches anomalies in its own way, and its instructions serve as a signal for additional verification. In any case, anomalies identified by the system are manually checked by a data quality engineer. To understand the reasons for the anomaly, he has to go down to the raw data and deal with it.

DEVELOPMENT PRINCIPLES AND TECHNOLOGICAL STACK OF SELF-SERVICE DQ ENGINE

The path to MVP was gradual: first we implemented basic metrics, then the ability to customize the confidence interval, and so on. Today Self Service DQ is already available for independent analyst work.

In the figure, horizontal arrows show the process in which the analyst participates. Showcases are calculated in SQL-Server or Cloudera Impala. This is a bank-wide analytical data platform based on a massively parallel mechanism for interactively executing SQL queries against data stored in Apache Hadoop.

The checks are written in SQL and Python via Jupyter notebook – quality metrics are calculated there and stored in Hadoop. The history of metrics is needed in order to see the dynamics of indicators over several years or quickly see bursts of data in the desired time interval.

Quality metrics data is visualized in Apache Superset. Due to the fact that the checks are unified and their results are stored in one table, we were able to build a single dashboard template. This is a very important step towards a self service tool. There is no need to build a separate dashboard for each new storefront. Once the data quality check process is completed, the dashboard is immediately available.

To organize a self-service mode, we have developed a page that any analyst can go to and enter information to set up a schedule for checking the data quality of a table or showcase. Everything is simple here: the specialist sets the name of the table, enters the frequency of running checks, and indicates which fields need to be excluded from the check, if necessary. It also indicates the required confidence intervals, that is, the percentage of deviation of the data from the previous slice. If the data has changed by a smaller percentage, this is not a CD incident; if by a larger percentage, it is. All this data is recorded in the configuration table.

The analyst sets the minimum settings, and then the mechanism operates in automatic mode. There are two main solutions at this stage:

Apache Airflow data quality analysis process orchestrator. It creates a DAG (Directed Acyclic Graph) that links all verification tasks together through configured dependencies and relationships. These dependencies determine exactly how the task chain (data pipeline) should work. The tasks that make up a DAG are called operators in Apache Airflow.
Apache Superset visual analytics tool. The results of DAG Airflow are automatically recorded in a table. The Superset tool works with it.

The analyst has nothing to do with the operation of these solutions; this is the responsibility of the engineers of the data analysis and modeling team.

DAG Airflow can be run multiple times in the same test session. It uses a Scheduler mechanism that keeps track of all created tasks. It regularly scans the settings table, which stores information about which display cases need to be checked with what frequency. By default, the scheduler checks the tasks available for launch three times a day, and if there are any, it starts them. If the storefront specified by the analyst has not yet been updated, the DAG will wait for the event and will launch later automatically.

POSSIBILITIES OF VISUAL ANALYTICS IN ASSESSING DATA QUALITY

Using the Apache Superset tool, the calculated data quality metrics are visualized. Three key options for presenting analytical results are used:

A pie chart that displays statistics on the deviations of all variables in the display case. Those with deviations are marked in orange. Blue – variables that did not deviate. Next to it, the same information is displayed in the form of a bar chart for the entire accumulated history of CD checks.
Detailed data in the form of bar charts that reflect the state of variables over the entire history of design inspections. Displays the number of unique empty row values, as well as the median, minimum, and maximum values.

Detailed information about the number of filled rows and the total value for an attribute is essentially raw data. If, for example, we see that the maximum value has deviated, but the median remains the same, it means that a change has occurred for a small number of customer records, applications and other data. There is an outlier, but the median has not shifted to the side. Or, let’s say, we see that, in percentage terms, the number of empty rows for one variable, for example, age, has increased by 300%. That’s a lot! But if you look at the detailed data, you can see that in the previous slice of 10 million there were three lines with the age filled in, and now there are 10 or 11. Despite 300%, we are talking about very small values, and this is not considered an incident.

Total value by attribute is a very useful tool for analyzing storefronts designed for models. The fact is that a data engineer usually builds a large number of fields. And some of them turn out to be similar. To avoid having to write the variable’s logic again, it copies the previous variable. The constraint condition changes. For example, a certain event for 1 month/for 3 months/for 6 months, etc. If a data engineer is distracted during the transfer, he may not change the logic. Then at the output we get two different variables that have absolutely the same logic. The sum by attribute will show that a slice by variable A and by variable B gives the same sum, that is, they duplicate each other.

PROSPECTS FOR THE DEVELOPMENT OF SELF-SERVICE DQ SERVICE

We plan to automate data quality management as much as possible, extend the experience of self-service checks to a larger amount of data, and speed up response to incidents.

In addition, we have experience in displaying the entire process of building a storefront: from the data source to the end consumer (building a data lineage). There are plans to combine this tool and Self-service DQ in such a way that in the event of a data quality incident, the source tables that were “affected” by this incident will be color-coded in the visual representation.

The bank is also implementing a new data platform, including the Data Science platform. The Self-service DQ tool will become an integral part of this platform.