Introducing Artie Transfer


As a data warehouse (DWH) heavy user in various roles, I have always found certain limitations (and frustrations!) with data latency between our DWH and the data that resides in online transactional (OLTP) databases. Depending on where I worked, the delay ranged from a few hours to a day(s), with larger companies tending to be slower.

In what use cases does such a replication delay interfere with work?

Given that DWH is a platform, there are many use cases that can be improved or enhanced by reducing data latency. Below I will give some examples:

Companies with a high operating load

In companies with a high operational load, as a rule, requirements and business processes are constantly changing. As a result, the engineering team usually can’t keep up – so solutions like Zapier, Typeform, Retool, Tinybird, and other no-code tools have become part of the standard toolset for these businesses.

Such tools can be stacked on top of each other and refer to data in DWH, the effectiveness of which is determined by the replication latency. At the same time, it is not at all necessary that All tables worked lightning fast. That would be nice… but most companies usually have a few mission-critical tables where that kind of speed is really needed.

For example, a food delivery company might set up the Retool app as custom Zendesk applicationso that it can retrieve the customer’s latest orders and interactions.

In addition to the previous examples, companies operating in these industries would benefit from reducing data latency in the following tables (so-called “critical tables”):

Industry

Critical table(s)

Delivery of products

orders

Real estate

property, documents

Ridesharing

travel

Support

tickets, emails

Lifecycle and Paid Marketing Use Cases

As part of Lifecycle marketing, it is customary to purchase such marketing automation toolslike Iterable, Braze and Intercom. Each of these tools has its own version of what the user model and events should look like, for example, marketers can create email templates such as: Hi {{first_name}}!

Examples of additional user attributes to send:

  1. Paid Marketing: If a customer requests a ride service [райдшеринг]we want to send as many of its features as possible to services like Google and Facebook so they can optimize their algorithm and find the most matches.

  2. Customer Acquisition Campaigns: When a customer registers on our site, we would like to use a drip campaign that effectively welcomes and engages the user. We would like to take into account data from dynamic fields, such as interaction with the product (did they do anything more than just register? Did they already view the page? Did they request a trip?) and other customer attributes.

So how do we send this data today?

As a rule, teams use different pipelines to create custom schemes for sending data to different destinations. With this configuration, there are certain disadvantages:

  1. Changes are blocked by engineering.

  2. Custom fields are almost never in the same service. As a result, we will have to call other services (now we need error handling and [повторные попытки]). Backfilling can also generate too much traffic and DDOS internal services not adapted to handle the load.

  3. Requires maintenance and ongoing support. If a field is added incorrectly and does not refer to an indexed one, this can slow down the entire pipeline.

  4. The data is usually not in the DWH. As a result, segmentation and reporting is difficult.

This could be solved by creating materialized views, especially if using dbt (a more expressive framework) for this. However, such solutions are not mature enough to handle most marketing use cases.

Why dbt or materialized views are not enough?

When creating a materialized view, we need to specify how often they are updated by setting a schedule (example: Snowflake tasks). In order to visualize the delay in this case, we write the following equation:

For example: we can force the materialized view to run every 5 minutes, but it won’t matter if the corresponding (referenced) table is updated every 6 hours. The resulting view is still 6 hours behind.

That is why we created Artie Transfer. Artie Transfer able to ease the first part of the equation: MAX(Referenced Table Staleness) [Устаревание ссылочной таблицы] eliminating the latency of raw OLTP tables. Thus, for faster browsing, we can simply increase the frequency of view generation.

How do companies solve the problem of data latency today and why is this not the best solution?

The traditional DWH data update process looks something like this:

  • Taking a snapshot of a Postgres table (in CSV format) with pg_dump

  • Parsing output and formatting data for DWH consumption

  • Loading this data into DWH

And again… How often can you do this daily?

The savvy companies that suffer from this tend to buy Fivetran for Databases or Fivetran Teleport. However, such a solution is not always acceptable for companies for various reasons such as cost, missing features/integrations, and others.

Engineering teams can and have tried to build this on their own, but the solution is complex and hard to scale. Why?

  • Change Data Capture Technology (CDC) does not support data description languages ​​(DDL) such as adding or removing columns.

  • Companies usually use more than one type of database, and each of them requires a different parser and possibly a new pipeline.

  • Reliability is a property of 0. Fast data replication is great, but only if the results are reliable. If the backend skips the CDC event or processes rows out of order, the resulting data is no longer consistent, and the end result is a misrepresentation.

  • DWHs are designed to handle high QPS (requests per second) for COPY commands, and will require a workaround to handle data mutations (updates and deletes).

Artie Transfer Introduction

Artie Transfer continuously transfers OLTP data (via CDC) and replicates it to the designated DWH, reducing replication latency from hours/days to seconds. We believe this will allow the entire ecosystem to generate more accurate representations and expose additional use cases that were previously unreachable due to this limitation.

In order to fulfill this promise, below is the configuration of Artie Transfer under the hood.

The through stream is divided into two parts:

#1 – Capturing CDC Events

We will use the connector to read database logs and publish them to Kafka. Usually Debezium is used for this when it is available.

  • There will be one topic per table, so subsequent workloads can be scaled independently based on the throughput of the table.

  • The Kafka message partition key will be the primary key for the table and one topic per table.

#2 – Artie Transfer

Artie Transfer subscribes to a Kafka topic and starts creating an in-memory resident database of what the topics look like (one Artie Transfer user can subscribe to one or more Kafka topics). Then, when the cleanup interval elapses (default 10s) or when the in-memory database is full, whichever comes first, Artie Transfer will cleanup. After cleanup, Artie Transfer executes an optimizing query against the target data store and mergits the micropackages of the changed data.

To support this workflow, Artie Transfer has the following features:

  1. Automatic retries (retries) and idempotency. We take reliability seriously, and this property is set to 0. Latency reduction is good, but it doesn’t matter if the data is wrong. We provide automatic retry and idempotency to always achieve final consistency.

  2. Automatic table creation. Transfer will create a table in the specified database if it does not exist.

  3. Bug reports. Submit your API key to Sentry and processing errors will appear in your Sentry project.

  4. Schema detection. Transfer will automatically detect column changes and post them to the destination.

  5. Scalable architecture. The Transfer architecture remains the same whether we are dealing with 1GB or 100+TB of data.

  6. Minimum delay. Transfer is built on top of a consumer framework and constantly transfers messages in the background. Say goodbye to controllers!

In addition to this offer, we also provide a paid managed version that includes:

  1. Set up your database to enable CDC and provide connectors to read and publish messages to Kafka.

  2. Providing a managed version of Artie Transfer

Finally, we believe that zero latency replication between OLTP and OLAP databases should be the norm and become widely available. In this regard, we have provided an open source code Artie Transfer!

At the end of the article, I would like to invite you to free webinar on the topic: “Why the database lay down to rest or performance optimization issues.”

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *