Who will be responsible for the quality of analytics: QA for Data Warehouse

Trust your eyes and what you see on the Dashboard

We are in Wheely We rely heavily on data to make operational and strategic decisions. From the payment of weekly bonuses to partners to expansion to other cities and countries.

Each manager or Product Owner knows his area intimately and any deviations can raise questions. Therefore, increased requirements are imposed on the reliability of dashboards and metrics. And we in the Analytics team strive to identify and fix problems before they get reported.

As you know, it is easier to prevent, and therefore I decided to approach the problem in a systematic and proactive manner. And, of course, the first thing I did was create a channel in Slack, in which I configured the delivery of notifications about any errors in our pipelines.

Confidence in the relevance of data marts

First of all, we want to make sure that custom storefronts contain data that is current at the time:

  • By 10 a.m. every day, we have calculated showcases for the last full day

  • Reading from sources keeps pace with the times and the lag does not exceed 8 hours

  • All sources continue to send a log of changes to DWH

It turns out that the QA task is formulated as follows:

  • Show me all the data marts where the time of relevance is behind the expected

Implementation for Storage:

  • In the .yml configuration file, add the parameter freshness:

freshness:
   warn_after: {count: 4, period: hour}
   error_after: {count: 8, period: hour}
 loaded_at_field: "__etl_loaded_at"

  • A simple templated SQL query will be executed for each test:

select
 max({{ loaded_at_field }}) as max_loaded_at,
 {{ current_timestamp() }} as snapshotted_at
from {{ source }}
where {{ filter }}
  • The collected metrics can be visualized in a summary report:

Monitoring Data Showcase Calculation Metrics

Inevitably, when designing complex showcases, cubes, problem areas will arise:

  • Bugs and miscalculations in the formulas for calculating metrics

  • Unexpected data (edge ​​cases) that can break the underlying logic

  • Bottleneck in settlement operations

They can lead to serious consequences:

  • Errors: Timeout, Out of Memory, Disk Full

  • Slowdown of the entire pipeline of downloads and calculations and violation of SLA

The following metrics can be collected for monitoring:

  • Time spent on the formation of the showcase + its dynamics (jumps in the calculation time)

  • CPU resource consumption

  • Disk and network resource consumption – IO, network

The leaders of this ranking are the first candidates for optimization and refactoring.

The task is formulated as follows:

  • Show me those showcases, the formation of which requires too many resources

Implementation for Storage:

  • Remove storefront calculation metrics

  • Render dashboard

  • Configure alerts

+pre-hook: "{{ logging.log_model_start_event() }}"
+post-hook: "{{ logging.log_model_end_event() }}"

Data schema validation at the heart of testing

Modern repositories involve structure, strong typing, column-wise storage, and data compression. A data structure is a schema – a set of attributes, their types, restrictions, for example, PRIMARY KEY, FOREIGN KEY, NOT NULL, UNIQUE.

Most DWHs do not provide constraint validation at the write stage. It is needed rather as a hint to the query optimizer. Those. This is just speculation about the data, and the actual verification is left to the engineers and analysts.

What basic expectations can we have regarding the data:

  • Are there NULLs in the data where they shouldn’t be?

  • What is the atomicity of my data (UNIQUE row ID)?

  • How do tables relate to each other (PRIMARY – FOREIGN KEYS)?

  • Are there any entries out of the ACCEPTED VALUES list?

The QA task is formulated as follows:

  • Show me the showcases and sources where data breaks our expectations

Implementation for Storage:

  • In the .yml configuration file, add the parameter tests:

- name: dim_cars
     description: Wheely partners cars.
     columns:
         - name: car_id
           tests:
               - not_null
               - unique
         - name: status
           tests:
               - not_null
               - accepted_values:
                   values: ['deleted', 'unknown', 'active', 'end_of_life', 'pending', 'rejected'
                           , 'blocked', 'expired_docs', 'partner_blocked', 'new_partner']   

  • A simple templated SQL query will be executed for each test

-- NOT NULL test
select count(*) as validation_errors
from "wheely"."dbt_test"."dim_cars"
where car_id is null
 
-- UNIQUE test
select count(*) as validation_errors
from (
   select
       car_id
   from "wheely"."dbt_test"."dim_cars"
   where car_id is not null
   group by car_id
   having count(*) > 1
) validation_errors
 
-- ACCEPTED VALUES test
with all_values as (
   select distinct
       status as value_field
   from "wheely"."dbt_test"."dim_cars"
),
validation_errors as (
   select
       value_field
   from all_values
   where value_field not in (
       'deleted','unknown','active','end_of_life','pending','rejected','blocked','expired_docs','partner_blocked','new_partner'
   )
)
select count(*) as validation_errors
from validation_errors

Business logic is also subject to validation

I use the term business logic in a broad sense, it is any arbitrarily complex logic of calculations and interconnections that we put in data marts. In fact, these are the business requirements that underlie the formation of Showcases.

A few simple examples:

  • The order amount cannot be negative

  • Car pick-up time on order is strictly longer than booking time

  • User session ends with only one order, or is interrupted

  • Commission does not exceed the specified%

Sharp jumps in amounts (both up and down), implausible graphs and metric values ​​often become the object of close attention on dashboards and reports.

The QA task is formulated as follows:

  • Show me those data marts where business rules are violated.

Implementation for Storage:

  • In terms of SQL, express the situation that describes a rule violation

  • Generate a test based on an SQL query

  • The test is considered passed (PASSED) if the query returns 0 records, and failed (FAILED) if records> = 1

Continuous Integration on guard of the DWH master branch

Ok, let’s move on. We work on DWH together with the whole team. This implies coordination and consistency of actions. However, there are frequent cases of errors, miscalculations, inattention at the development stage, which lead to errors in the PROD environment after PR Merge:

  • Improvement in one part may cause an error in another part

  • The DEV environment of the engineer may differ from the PROD environment

  • Running suboptimal code on all data may result in an error (e.g. Out of Memory)

The solution was invented for a long time – this is the use of testing practices within the framework of Continuous Integration (CI). And it can and should be applied to data!

The task is formulated as follows:

  • Minimize the likelihood of errors in the master branch and the DWH PROD environment after releases.

Implementation for Storage:

  • Prepare the environment for CI (for example, an up-to-date copy of the PROD environment containing only the last 7 days)

  • Perform a full recalculation of all storefronts and metrics without errors before allowing the feature branch to be merged into master

Cross-checking DWH status and Sources

We expect the Data Warehouse to display the current state (as well as the entire history) of data sources:

  • The presence in DWH of all records that are present in the source

  • Exact one-to-one matching of attribute values ​​(status, timestamps, metrics)

We would like to insure against situations when a part of the data simply does not fall into the Storage due to technical reasons, as well as against the loss of changes in the state of records (for example, finalizing the status of an order or adjusting its value).

The task is formulated as follows:

  • Make sure that the Storage is in a consistent (consistent) state with the sources.

This task has one of the most complex implementations and can become the topic of a separate article, judge for yourself:

  • Determine the set of attributes to be verified: identifiers, foreign keys, dimensions (country, city, status), metrics (cost, duration, commission).

  • Unload all lines from the source that are currently relevant

  • Load lines to DWH and prepare reconciliation logic

  • Customize visualization and notifications

Visual representation with the ability drill-down to the level of atomic records:

Putting everything into a single puzzle

The ideas and principles outlined in this article are transparent and clear. Today is the time to apply the best development and testing practices in the areas related to data processing:

  • Regular monitoring, collection and analysis of metrics

  • Continuous Integration and Testing

  • Setting up notifications and alerts for the team

  • Proactive work to eliminate incidents and causes of errors

  • Managing user expectations in case of problems (We have everything under control)

In this article, I showed the approach to data quality control that we build as a team. Wheely… I hope you have learned a lot of new and useful things.

An extensive set of topics related to processing, storing, testing data is studied within the course Data Engineer in OTUS, the launch of which will take place very soon.

As a course instructor, I invite you on November 4th at 20:00 for Open Day of the Data Engineer course… Come to webinars in OTUS get to know me and other experts, we will wait.

What else to read

Finally, I’ll leave you a few links on related topics for further study:

  1. Data Build Tool or what is common between Data Warehouse and Smoothie – DBT overview in Russian

  2. The farm-to-table testing framework – an integrated approach to data quality testing

  3. Tests – Related reference docs – the testing section of the DBT documentation

  4. How to get started with data testing – thread on dbt discourse with a discussion on the topic

  5. Data testing: why you need it – a look at the benefits of data testing

  6. Manual Work is a Bug – a few words about the principles of automation and DRY

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *