Today’s world is undergoing yet another technological revolution, one of the aspects of which is the use by all kinds of companies of accumulated data to promote their own sales flywheel, profits and PR. It seems that it is the presence of good (quality) data, as well as skillful brains that can make money from them (correctly process, visualize, build machine learning models, etc.), have become today the key to success for many. If 15-20 years ago, large companies were engaged in dense work with the accumulation of data and their monetization, today it is the lot of almost all sane people.
In this regard, several years ago, all job search portals around the world began to be filled with Data Scientists vacancies, since everyone was sure that by getting such a specialist on staff you can build a machine learning supermodel, predict the future and make a “quantum Leap for the company. Over time, people realized that this approach almost never works, since not all the data that falls into the hands of such specialists is suitable for training models.
And the requests from Data Scientists began: “Let’s buy more data from these and those. …”, “We do not have enough data …”, “We need some more data and preferably high-quality …”. Based on these requests, numerous interactions between companies that own a particular data set began to line up. Naturally, this required the technical organization of this process – to connect to the data source, download them, check that they are fully loaded, etc. The number of such processes began to grow, and today we have received a huge need for another kind of specialists – Data Quality engineers – those who would monitor the data flow in the system (data pipelines), the quality of the data at the input and output, would draw conclusions about their adequacy, integrity and other characteristics.
The trend for Data Quality engineers came to us from the United States, where in the midst of a raging era of capitalism no one is ready to lose the battle for data. Below I have provided screenshots from the two most popular job search sites in the USA: www.monster.com and www.dice.com – on which data are displayed as of March 17, 2020 by the number of placed vacancies received, by keywords: Data Quality and Data Scientist.
|Data Scientists – 21,416 Jobs||Data Quality – 41,104 jobs|
|Data Scientists – 404 jobs||Data Quality – 2020 jobs|
Obviously, these professions in no way compete with each other. With screenshots, I just wanted to illustrate the current situation in the labor market in terms of requests for Data Quality engineers, who now require much more than Data Scientists.
In June 2019, EPAM, responding to the needs of the modern IT market, highlighted Data Quality as a separate practice. Data Quality engineers, in the course of their daily work, manage data, check their behavior in new conditions and systems, and control the relevance of the data, its adequacy and relevance. With all this, in the practical sense of Data Quality, engineers really devote a bit of time to classical functional testing, BUT it depends a lot on the project (I will give an example later).
The duties of the Data Quality engineer are not limited to routine manual / automatic checks for “nulls, count and sums” in the database tables, but require a deep understanding of the customer’s business needs and, accordingly, the ability to transform existing data into suitable business information.
Data Quality Theory
In order to most fully imagine the role of such an engineer, let’s figure out what Data Quality is in theory.
Data quality – One of the stages of Data Management (the whole world that we will leave for you to study independently) and is responsible for analyzing data according to the following criteria:
I think it’s not worth deciphering each of the points (in theory they are called “data dimensions”), they are quite well described in the picture. But the testing process itself does not imply strict copying of these signs into test cases and their verification. In Data Quality, as in any other type of testing, it is necessary first of all to build on requirements for data quality agreed upon with project participants making business decisions.
Depending on the Data Quality project, an engineer can perform various functions: from an ordinary automated tester with a superficial assessment of data quality, to the person who conducts their deep profiling according to the above criteria.
A very detailed description of the Data Management, Data Quality, and related processes is well described in a book called “DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition”. I highly recommend this book as an introduction to this topic (you will find a link to it at the end of the article).
In the IT industry, I have gone from a Junior Product Tester to a Lead Data Quality Engineer at EPAM. After about two years as a tester, I had a firm belief that I did absolutely all types of testing: regression, functional, stressful, stability, security, UI, etc. – and tried a large number of testing tools, having worked at the same time in three programming languages: Java, Scala, Python.
Looking back, I understand why the set of my professional skills turned out to be so diverse – I participated in projects tied to working with data, large and small. This is what led me into the world of a large number of tools and opportunities for growth.
To appreciate the variety of tools and opportunities for obtaining new knowledge and skills, just look at the picture below, which shows the most popular of them in the world of “Data & AI”.
This kind of illustration is compiled annually by one of the well-known venture capitalists Matt Turck, a native of software development. Here link on his blog and venture capital firmwhere he works as a partner.
Especially fast I grew professionally when I was the only tester on the project, or at least at the beginning of the project. It is at such a moment that you have to be responsible for the entire testing process, and you have no way to step back, only forward. At first it was a scarecrow, but now all the advantages of such a test are obvious to me:
- You begin to communicate with the whole team more than ever, since there is no proxy for communication: neither a test manager, nor fellow testers.
- Immersion in the project becomes incredibly deep, and you possess information about all the components, both in general and in detail.
- Developers do not look at you as “that guy from testing who does not understand what he is doing”, but rather, as an equal, producing incredible benefits for the team with his own self-tests and anticipation of bugs in a particular product node.
- As a result, you are more efficient, more qualified, more in demand.
As the project grew 100% of the time, I became a mentor for testers who came to him again, trained them and transferred the knowledge that I had learned. At the same time, depending on the project, I did not always get the highest level of auto testing experts from the management and there was a need to either train them in automation (for those who wish), or create tools for using them in everyday activities (data generation tools and loading them into the system , a tool for conducting load testing / stability testing “quick”, etc.).
Specific Project Example
Unfortunately, due to non-disclosure obligations, I can’t talk in detail about the projects I worked on, but I will give examples of typical Data Quality Engineer tasks on one of the projects.
The essence of the project is to implement a platform for preparing data for training machine learning models based on them. The customer was a large pharmaceutical company from the United States. Technically it was a cluster Kubernetesclimbing AWS EC2 instances, with multiple microservices and the underlying EPAM Open Source project – Legionadapted to the needs of a particular customer. ETL processes were organized using Apache airflow and moved data from SalesForce customer’s systems AWS S3 Buckets. Further, the docker image of the machine learning model was deployed on the platform, which was trained on the latest data and produced predictions based on the REST API interface that were of interest to the business and solving specific problems.
Visually, everything looked something like this:
There was plenty of functional testing on this project, and given the speed of developing features and the need to maintain the pace of the release cycle (two-week sprints), it was necessary to immediately think about the automation of testing the most critical system nodes. Most of the platform itself with Kubernetes was covered by self-tests implemented on Robot framework + Python, but they also needed to be maintained and expanded. In addition, for the convenience of the customer, a GUI was created to manage machine learning models embedded in a cluster, as well as the ability to specify where and where to transfer data for model training. This extensive addition entailed the expansion of automated functional checks, which were mostly done through REST API calls and a small number of end-2-end UI tests. At about the equator of all this movement, we were joined by a manual tester who did an excellent job of accepting testing of product versions and communicating with the customer about accepting the next release. In addition, due to the emergence of a new specialist, we were able to document our work and add some very important manual checks that were difficult to automate right away.
And finally, after we achieved stability from the platform and a GUI add-on above it, we started building ETL pipelines using Apache Airflow DAGs. Automated data quality control was carried out by writing special Airflow DAGs that checked the data according to the results of the ETL process. As part of this project, we were lucky, and the customer gave us access to anonymized data sets, on which we tested. We checked the data line by line for compliance with types, the presence of broken data, the total number of records before and after, comparing the transformations performed by the ETL process for aggregation, changing the names of columns and other things. In addition, these checks were scaled to different data sources, for example, in addition to SalesForce, also on MySQL.
Checks of the final data quality were carried out already at the S3 level, where they were stored and were ready-to-use for learning machine learning models. To obtain data from the final CSV file lying on the S3 Bucket and validate it, code was written using boto3 client.
There was also a requirement on the part of the customer to store part of the data in one S3 Bucket’e, part in another. For this, it was also required to write additional checks to verify the reliability of such a sort.
Generalized experience on other projects
An example of the most generalized list of Data Quality activities of an engineer:
- Prepare test data (valid / invalid / large / small) through an automated tool.
- Download the prepared dataset to the original source and check its readiness for use.
- Launch ETL processes for processing a data set from the source storage to the final or intermediate using a specific set of settings (if possible, set configurable parameters for the ETL task).
- Verify the data processed by the ETL process for their quality and compliance with business requirements.
At the same time, the main emphasis of checks should not only be on the fact that the data flow in the system has worked out and reached the end (which is part of functional testing), but mainly on the verification and validation of data for compliance with expected requirements, identifying anomalies and other things.
One of the techniques for such data control can be the organization of chain checks at each stage of data processing, the so-called “data chain” in the literature – data control from the source to the point of final use. Such checks are most often implemented by writing validating SQL queries. It is clear that such requests should be as lightweight as possible and checking individual pieces of data quality (tables metadata, blank lines, NULLs, Errors in syntax – other required verification attributes).
In the case of regression testing, which uses ready-made (immutable / slightly changeable) data sets, ready-made templates for checking data against quality (descriptions of expected table metadata; lowercase selective objects that can be selected randomly during the test, can be stored in the autotest code) etc).
Also, during testing, you have to write test ETL processes using frameworks such as Apache Airflow, Apache spark or even a black-box cloud type tool Gcp dataprep, Gcp dataflow etc. This circumstance makes the test engineer immerse himself in the principles of the above tools and even more effectively how to conduct functional testing (for example, existing ETL processes on the project), and use them to verify the data. In particular, for Apache Airflow, there are ready-made operators for working with popular analytical databases, for example GCP BigQuery. The most basic example of its use has already been outlined. heretherefore I will not repeat.
In addition to ready-made solutions, no one forbids you to sell your techniques and tools. This will not only be useful for the project, but also for the Data Quality Engineer itself, which thereby pumps its technical horizons and coding skills.
How it works on a real project
A good illustration of the last paragraphs about “data chain”, ETL and ubiquitous checks is the following process from one of the real projects:
Here, different data (naturally, prepared by us) falls into the input “funnel” of our system: valid, invalid, mixed, etc., then they are filtered and end up in an intermediate storage, then a series of transformations awaits them again and places them in the final storage , from which, in turn, analytics will be performed, building data marts and searching for business insights. In such a system, without checking functionally the operation of ETL processes, we focus on the quality of data before and after the transformations, as well as on going to analytics.
Summarizing the above, regardless of the places where I worked, I was everywhere involved in Data projects that combined the following features:
- Only through automation can you verify some cases and achieve a business-friendly release cycle.
- The tester on such a project is one of the most respected team members, as it brings great benefit to each of the participants (accelerated testing, good data from Data Scientist, identification of defects in the early stages).
- It doesn’t matter if you work on your hardware or in the clouds – all resources are abstracted into a cluster such as Hortonworks, Cloudera, Mesos, Kubernetes, etc.
- Projects are built on a microservice approach, distributed and parallel computing prevail.
I note that, while testing in the field of Data Quality, the testing specialist shifts his professional focus to the product code and tools used.
Distinctive features of Data Quality testing
In addition, for myself, I highlighted the following (I will immediately make a reservation VERY generalized and exclusively subjective) distinguishing features of testing in Data (Big Data) projects (systems) and other areas:
- Theory: DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition.
- Training center EPAM
- Recommended materials for the novice Data Quality engineer:
- Free Stepik course: Introduction to Databases.
- LinkedIn Learning Course: Data Science Foundations: Data Engineering.
Data quality – This is a very young promising area, to be part of which means to be part of a startup. Once in Data Quality, you will plunge into a large number of modern technologies that are in demand, but most importantly – you will have huge opportunities for generating and implementing your ideas. You can use the approach of continuous improvement not only on the project, but also for yourself, continuously developing as a specialist.