Data Build Tool or what is in common between a Data Warehouse and a Smoothie

What are the principles of an ideal Data Warehouse?

Focus on business value and analytics in the absence of boilerplate code. Managing DWH as a codebase: versioning, review, automated testing, and CI. Modularity, extensibility, open source and community. Friendly user documentation and dependency visualization (Data Lineage).

More about all this and about the role of DBT in the Big Data & Analytics ecosystem – welcome to cat.

Hello everyone

In touch with Artemy Kozyr. For over 5 years I have been working with data warehouses, building ETL / ELT, as well as data analytics and visualization. I am currently working in Wheelyteaching at OTUS on the course Data engineer, and today I want to share with you an article that I wrote on the eve of the launch new course set.

Short review

The DBT framework is all about the letter T in the acronym ELT (Extract – Transform – Load).

With the advent of such productive and scalable analytic databases as BigQuery, Redshift, Snowflake, any sense to make transformations outside the Data Warehouse disappeared.

DBT does not download data from sources, but provides tremendous opportunities to work with data that is already loaded into the Storage (in Internal or External Storage).

The main purpose of DBT is to take the code, compile it in SQL, execute the commands in the correct sequence in the Repository.

DBT project structure

The project consists of directories and files of only 2 types:

  • Model (.sql) – unit of transformation expressed by a SELECT query
  • Configuration file (.yml) – parameters, settings, tests, documentation

At a basic level, work is structured as follows:

  • User prepares model code in any convenient IDE
  • Using the CLI, models are launched, DBT compiles the model code in SQL
  • Compiled SQL code is executed in the Warehouse in the specified sequence (graph)

Here’s what the launch from the CLI might look like:

Everything is SELECT

This is a killer feature of the Data Build Tool framework. In other words, DBT abstracts all the code related to the materialization of your queries in the Warehouse (variations from the CREATE, INSERT, UPDATE, DELETE ALTER, GRANT, … commands).

Any model involves writing one SELECT-query, which defines the resulting data set.

At the same time, the transformation logic can be multilevel and consolidate data from several other models. An example of a model that will build a showcase of orders (f_orders):

{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
 
with orders as (
 
   select * from {{ ref('stg_orders') }}
 
),
 
order_payments as (
 
   select * from {{ ref('order_payments') }}
 
),
 
final as (
 
   select
       orders.order_id,
       orders.customer_id,
       orders.order_date,
       orders.status,
       {% for payment_method in payment_methods -%}
       order_payments.{{payment_method}}_amount,
       {% endfor -%}
       order_payments.total_amount as amount
   from orders
       left join order_payments using (order_id)
 
)
 
select * from final

What interesting things can we see here?

First: CTE (Common Table Expressions) are used to organize and understand code that contains a lot of transformations and business logic.

Second: Model code is a mixture of SQL and language Jinja (templating language).

The example used a loop for to generate the amount for each payment method specified in the expression set. Also used function ref – the ability to refer inside the code to other models:

  • At compile time ref will be converted to a target pointer to a table or view in the Repository
  • ref allows you to build a graph of model dependencies

Exactly Jinja Adds almost unlimited possibilities to DBT. The most commonly used ones:

  • If / else statements – branching statements
  • For loops – loops
  • Variables – Variables
  • Macro – Create Macros

Materialization: Table, View, Incremental

Materialization Strategy – an approach according to which the resulting set of model data will be stored in the Repository.

In a basic consideration, this is:

  • Table – physical table in the Storage
  • View – view, virtual table in the Repository

There are more complex materialization strategies:

  • Incremental – incremental loading (large fact tables); new lines are added, modified ones are updated, deleted ones are cleared
  • Ephemeral – the model does not materialize directly, but participates as CTE in other models
  • Any other strategies you can add yourself

In addition to materialization strategies, opportunities are opened up for optimization for specific Warehouses, for example:

  • Snowflake: Transient tables, Merge behavior, Table clustering, Copying grants, Secure views
  • Redshift: Distkey, Sortkey (interleaved, compound), Late Binding Views
  • Bigquery: Table partitioning & clustering, Merge behavior, KMS Encryption, Labels & Tags
  • Spark: File format (parquet, csv, json, orc, delta), partition_by, clustered_by, buckets, incremental_strategy

The following Repositories are currently supported:

  • Postgres
  • Redshift
  • Bigquery
  • Snowflake
  • Presto (partially)
  • Spark (partially)
  • Microsoft SQL Server (community adapter)

Let’s refine our model:

  • Make its filling incremental (Incremental)
  • Add segmentation and sort keys for Redshift

-- Конфигурация модели: 
-- Инкрементальное наполнение, уникальный ключ для обновления записей (unique_key)
-- Ключ сегментации (dist), ключ сортировки (sort)
{{
  config(
       materialized='incremental',
       unique_key='order_id',
       dist="customer_id",
       sort="order_date"
   )
}}
 
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
 
with orders as (
 
   select * from {{ ref('stg_orders') }}
   where 1=1
   {% if is_incremental() -%}
       -- Этот фильтр будет применен только для инкрементального запуска
       and order_date >= (select max(order_date) from {{ this }})
   {%- endif %} 
 
),
 
order_payments as (
 
   select * from {{ ref('order_payments') }}
 
),
 
final as (
 
   select
       orders.order_id,
       orders.customer_id,
       orders.order_date,
       orders.status,
       {% for payment_method in payment_methods -%}
       order_payments.{{payment_method}}_amount,
       {% endfor -%}
       order_payments.total_amount as amount
   from orders
       left join order_payments using (order_id)
 
)
 
select * from final

Model Dependency Graph

He is a tree of dependencies. He is a DAG (Directed Acyclic Graph – Directional Acyclic Graph).

DBT builds a graph based on the configuration of all project models, or rather ref () links inside models to other models. Having a graph allows you to do the following things:

  • Running models in the correct sequence
  • Parallelization of window dressing
  • Running an arbitrary subgraph

Graph visualization example:

Each node of the graph is a model, the edges of the graph are given by the expression ref.

Data Quality and Documentation

In addition to the formation of the models themselves, DBT allows you to test a number of assertions about the resulting data set, such as:

  • Not null
  • Unique
  • Reference Integrity – reference integrity (for example, customer_id in the orders table corresponds to id in the customers table)
  • Matching Valid List

You can add your own tests (custom data tests), such as, for example,% deviation of revenue with indicators a day, a week, a month ago. Any assumption formulated as an SQL query can be a test.

In this way, unwanted deviations and errors in data can be caught in the storefront of the Storage.

In terms of documentation, DBT provides mechanisms for adding, versioning, and distributing metadata and comments at the model level and even attributes.

Here is how adding tests and documentation at the configuration file level looks like:

 - name: fct_orders
   description: This table has basic information about orders, as well as some derived facts based on payments
   columns:
     - name: order_id
       tests:
         - unique # проверка на уникальность значений
         - not_null # проверка на наличие null
       description: This is a unique identifier for an order
     - name: customer_id
       description: Foreign key to the customers table
       tests:
         - not_null
         - relationships: # проверка ссылочной целостности
             to: ref('dim_customers')
             field: customer_id
     - name: order_date
       description: Date (UTC) that the order was placed
     - name: status
       description: '{{ doc("orders_status") }}'
       tests:
         - accepted_values: # проверка на допустимые значения
             values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

And here is how this documentation already looks on the generated website:

Macros and Modules

The purpose of DBT is not so much to become a set of SQL scripts, but to provide users with powerful and feature-rich tools for building their own transformations and distributing these modules.

Macros are sets of constructs and expressions that can be called as functions within models. Macros allow you to reuse SQL between models and projects in accordance with the DRY (Don’t Repeat Yourself) engineering principle.

Macro Example:

{% macro rename_category(column_name) %}
case
 when {{ column_name }} ilike  '%osx%' then 'osx'
 when {{ column_name }} ilike  '%android%' then 'android'
 when {{ column_name }} ilike  '%ios%' then 'ios'
 else 'other'
end as renamed_product
{% endmacro %}

And its use:

{% set column_name = 'product' %}
select
 product,
 {{ rename_category(column_name) }} -- вызов макроса
from my_table

DBT comes with a package manager that allows users to publish and reuse individual modules and macros.

This means the ability to download and use libraries such as:

  • dbt_utils: work with Date / Time, Surrogate Keys, Schema tests, Pivot / Unpivot and others
  • Ready-made showcase templates for services such as Snowplow and Stripe
  • Libraries for specific Data Warehouses, for example Redshift
  • Logging – Module for logging DBT work

A complete list of packages is available at dbt hub.

Even more features

Here I will describe several other interesting features and implementations that I and the team use to build the Data Warehouse in Wheely.

Separation of runtime environments DEV – TEST – PROD

Even within the same DWH cluster (within different schemes). For example, using the following expression:

with source as (
 
   select * from {{ source('salesforce', 'users') }}
   where 1=1
   {%- if target.name in ['dev', 'test', 'ci'] -%}           
       where timestamp >= dateadd(day, -3, current_date)   
   {%- endif -%}
 
)

This code literally says: for environments dev, test, ci take data only for the last 3 days and no more. That is, running in these environments will be much faster and require less resources. When running on environment prod filter condition will be ignored.

Alternate column coding materialization

Redshift is a column DBMS that allows you to specify data compression algorithms for each individual column. The choice of optimal algorithms can reduce the occupied disk space by 20-50%.

Macro redshift.compress_table will execute the ANALYZE COMPRESSION command, create a new table with the recommended column coding algorithms indicated by the segmentation keys (dist_key) and sort_key (sort_key), transfer the data to it, and if necessary, delete the old copy.

Macro Signature:

{{ compress_table(schema, table,
                 drop_backup=False,
                 comprows=none|Integer,
                 sort_style=none|compound|interleaved,
                 sort_keys=none|List,
                 dist_style=none|all|even,
                 dist_key=none|String) }}

Logging Model Launches

For each execution of the model, you can hang hooks that will be executed before launch or immediately after the creation of the model:

   pre-hook: "{{ logging.log_model_start_event() }}"
   post-hook: "{{ logging.log_model_end_event() }}"

The logging module will allow you to record all the necessary metadata in a separate table, according to which you can subsequently audit and analyze problem areas (bottlenecks).

Here’s what the dashboard looks like on the lookup data in Looker:

Storage Automation

If you use any extensions of the functionality of the Storage used, such as UDF (User Defined Functions), then versioning these functions, access control, and automatically rolling out new releases is very convenient to implement in DBT.

We use UDF in Python to calculate hash values, mail domain domains, and decode bitmask.

Example macro that creates UDF on any runtime (dev, test, prod):

{% macro create_udf() -%}
 
 {% set sql %}
       CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
           RETURNS varchar
           LANGUAGE plpythonu
           STABLE
       AS $$  
           import hashlib
           return hashlib.sha256(mes).hexdigest()
       $$
       ;
 {% endset %}
  
 {% set table = run_query(sql) %}
 
{%- endmacro %}

At Wheely, we use Amazon Redshift, which is based on PostgreSQL. For Redshift, it is important to regularly collect statistics on tables and free up disk space – the ANALYZE and VACUUM commands, respectively.

To do this, the commands from the redshift_maintenance macro are executed every night:

{% macro redshift_maintenance() %}
 
   {% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
 
   {% for row in vacuumable_tables %}
       {% set message_prefix=loop.index ~ " of " ~ loop.length %}
 
       {%- set relation_to_vacuum = adapter.get_relation(
                                               database=row['table_database'],
                                               schema=row['table_schema'],
                                               identifier=row['table_name']
                                   ) -%}
       {% do run_query("commit") %}
 
       {% if relation_to_vacuum %}
           {% set start=modules.datetime.datetime.now() %}
           {{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
           {% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
           {{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
           {% do run_query("ANALYZE " ~ relation_to_vacuum) %}
           {% set end=modules.datetime.datetime.now() %}
           {% set total_seconds = (end - start).total_seconds() | round(2)  %}
           {{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
       {% else %}
           {{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
       {% endif %}
 
   {% endfor %}
 
{% endmacro %}

DBT Cloud

It is possible to use DBT as a service (Managed Service). In a set:

  • Web IDE for developing projects and models
  • Job configuration and schedule setting
  • Simple and convenient access to logs
  • Web site with the documentation of your project
  • CI (Continuous Integration) Connection

Conclusion

Cooking and consuming DWH is just as enjoyable and beneficial as drinking smoothies. DBT consists of Jinja, custom extensions (modules), compiler, engine (executor) and package manager. Having collected these elements together you get a full-fledged working environment for your Data Warehouse. There is hardly a better way to manage transformations within DWH today.


The beliefs followed by DBT developers are formulated as follows:

  • Code, not GUI, is the best abstraction for expressing complex analytic logic
  • Work with data should adapt the best practices of software development (Software Engineering)

  • Critical data infrastructure must be controlled by the user community as open source software
  • Not only analytics tools, but code will increasingly become part of the Open Source community.

These core beliefs have spawned a product that is used today by more than 850 companies, and they form the basis of many interesting extensions that will be created in the future.

For those who are interested, there is a video of an open lesson that I spent a few months ago as part of an open lesson in OTUS – Data Build Tool for Amazon Redshift Storage.

In addition to DBT and Data Warehouses, as part of the Data Engineer course on the OTUS platform, my colleagues and I conduct classes on a number of other relevant and modern topics:

  • Architectural Concepts for Big Data Applications
  • Practice with Spark and Spark Streaming
  • Learning methods and tools for loading data sources
  • Building analytic showcases in DWH
  • NoSQL Concepts: HBase, Cassandra, ElasticSearch
  • Principles of monitoring and orchestration
  • Final Project: we collect all skills together under mentorship support

References:

  1. DBT documentation – Introduction – Official documentation
  2. What, exactly, is dbt? – Review article by one of the authors of DBT
  3. Data Build Tool for Amazon Redshift Storage – YouTube, OTUS Open Lesson Recording
  4. Introducing Greenplum – The nearest open lesson is May 15, 2020
  5. Data Engineering Course – OTUS
  6. Building a Mature Analytics Workflow – A look at the future of working with data and analytics
  7. It’s time for open source analytics – Evolution of analytics and the influence of Open Source
  8. Continuous Integration and Automated Build Testing with dbtCloud – Principles of building CI using DBT
  9. Getting started with DBT tutorial – Practice, step-by-step instructions for independent work
  10. Jaffle shop – Github DBT Tutorial – Github, training project code

More about the course.


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *