Optimizing Analytics with DataHub

Below I will tell you about our experience of using DataHub and situations in which the tool can be useful. We hope that it will be useful for product and data analysts; managers who are focused on establishing order in processes and everyone who is interested in relevant tooling for IT teams. Let's go!

Starting Point: Capture Key Data Sources

First, a little about the specifics of our data and where it comes from.

In Sravni, we are making a financial marketplace: we provide services for online selection and registration of MTPL, loans, credit rating checks and much more. The products differ in the monetization model, user behavior and data logic. In this sense, we are “20 different companies in one”. Accordingly, we collect our own set of data from each product.

There is a common layer of external data for all: web and mobile analytics, CRM marketing. Some data is related to platform services common to all products – for example, the authorization and registration service or SMS communications.

There are our own sales channels, such as an affiliate program and an agent account for working on the B2B2C model (we contact agents who find clients for us). There are external platforms, website owners, various stores: Google Play, RuStore, vendor stores (Samsung, Huawei, etc.).

We take all the data into DWH, where it is divided into several levels. First comes raw data, then data marts, then aggregates. Finally, this data can be used for various tasks: creating reports, alerts on business metrics, collecting segments for CRM marketing, analyzing A/B test results.

Let's take a specific source, for example, AppsFlyer, a mobile analytics system. Employees work with a ready-made data showcase – a flat table where they come and calculate something. But to collect it, about 20-30 different tables with raw data are required. Installs on iOS, Android, organic and non-organic installs, retargeting, events, and so on. That is, we are talking about a huge array of different-sized data. This is where DataHub comes into play.

Why We Needed DataHub

Why did we decide to implement DataHub? I'll tell you with an example.

Imagine a new analyst, data engineer, developer starts working. Or one of the current employees moves to another product or area — the employee is not yet familiar with the specifics of the new activity.

Previously, we had to conduct master classes on Zoom and say: “This table counts this way, this data comes from here.” Not the most convenient way, so at some point we started keeping documentation in a knowledge base. Over time, changes were made less and less regularly, the data lost its relevance, and we eventually forgot about the table in the knowledge base.

With DataHub, everything has become much simpler. To help employees get used to the system faster at the onboarding stage, we provide them with access to a system where they can view the entire path from source to Value. And quickly see what data a particular report is linked to: DataHub automatically builds its connections.

Metadata is also available here: where each table is located, how many rows it has, when the last update was; read the basic description, look at the subject area tag.

DataHub allows us to study the structure, expand data sources in a layer, and see where things are taken from. It is not easy to explain all this in words, without extensive documentation. You can tell where a report is and what it is for, but it is unlikely that you will be able to convey the entire scale — the number of tables and their relationships — verbally. In this sense, DataHub works as an entry point into the company's data array.

DataHub in Practice: Case Studies

Let's look at an example of using DataHub in Compare: when we analyze data starting from a specific product showcase.

For example, we select the Calculation_Osago showcase. We see where it is located, the name of the tables, the number of rows; we see that the field names, data type, and description are automatically pulled up.

The Documentation tab contains a full description of the showcase, data sources, update frequency, and other important information. You can click on the sources and view them in the repository.

Here you can also see statistics for each table to profile the data. For each field, the minimum and maximum average value, median value, number of NULLs, number and percentage of unique values are provided. For example, you can check if there are duplicate rows: to do this, you need to make sure that the Distinct indicator of the unique identifier (hash ID) is 100%. If this is not the case, then there is a duplicate somewhere, and you can see what the problem is.

Next is lineage, where all the connections of a specific showcase are displayed. If at first we went from reports to the source, now we seem to be stuck in the middle of the path. Here you can see all the reports related to the showcase, and different analysts can work with them. If suddenly something in the business logic became irrelevant and we need to delete it, we can check who works with these tables and whether our manipulations will break the report. And vice versa: we see what is assembled from what.

Something happens with a report or with Telegram Bot — it doesn’t matter, with any piece of analytics, where we find one or another metric or value suspicious. For example, locations have fallen off or are displayed incorrectly. In DataHub, we can see what the report is related to, follow the entire path and check where the data has not loaded.

As is well known, DWH specialists do everything according to the ETL process. Thanks to alerts, they find out in time when the data should have arrived, but did not, the showcase was not updated, and so on. And so we realized that there are problems with locations, identified them, and now we are looking at what could have affected this. We warned customers: such and such reports are temporarily unavailable, we will fix them within such and such a time frame. DWH takes the DataHub source code, deploys it and integrates it with all the systems that our analysts use. Then DataHub builds all the connections.

This is also convenient from an update point of view. For example, we deleted a report or collapsed two reports into one. There is no need to manually delete it in DataHub, the system itself goes through the entire pipeline and ETL and periodically updates.

DataHub helps in a similar way in terms of business analytics. We describe communication showcases (columns, metrics, logic, links to Git), pull dependencies to reports in Apache Superset: convenient for both analysts and report users. If specific metrics fall, you can always track in which data to look for the problem (for example, we understand that there is a problem with UTM markup – and boldly go to CRM).

How not to get lost in data: a magic pill

To summarize the capabilities of the tool, I will list the scenarios in which it turned out to be most useful to us.

Onboarding colleagues. DataHub helps you quickly master the context and understand what is where in terms of data, without running around the DWH and asking other analysts. It allows you to go from analytics to showcases, fall into data, or, conversely, from data to analytics. Among other things, it helps reduce the bus factor — the dependence of teams on the knowledge of selected experienced employees.
Investigations: When Things Go Wrong. And we need to understand where exactly. To do this, we go through the entire chain in DataHub.
View dependencies. If we want to delete data marts, refactor, rename or delete fields — and make sure nothing will break in the end, we can look at everything in advance. And also find out who is responsible for this or that data. One of our recent cases: migrating data marts to Greenplum. DataHub provided a transparent picture of where the data in the mart comes from; data engineers were able to clearly track which data sources are involved in the process and painlessly carried out the migration.
Optimizing processes around data. Using DataHub, you can, for example, find that many reports look at the second data layer, and this does not work optimally. So, to speed up the work, you need to create a separate showcase for it.
Documentation support. With DataHub, you can create and maintain live documentation for the entire data workflow, including sources, transformations, showcases, and even business logic. Given the large amount of different data, maintaining and updating such documentation can be easier and more convenient than in Confluence.
Improving data security. DataHub allows you to quickly identify and evaluate who is using what data. This helps track access and minimize the risks of leakage or unauthorized use of data. In the general catalog, you can set up different access levels and rights. For example, you can close access to personal data at the request of information security – and analysts will not see this field, because it is not important for them.

How do you know if you need DataHub?

First, assess the scale. If the analytics in the company is not too large (this is not so much about the amount of data, but about the structure, complexity of business logic), implementing DataHub may not be practical. After all, you need to raise the entire system, understand it, configure it. And the output may not bring significant benefits – “like hammering nails with a microscope.”

Secondly, it is worth rechecking how things went in your company with previous attempts to implement something new. DataHub offers those involved in working with data to do business differently – which means it will be necessary to tell and show a lot. To accustom people to the idea that for data you need to go not to personal messages and chats, but to the system.

Looking back at the experience of past internal implementations will help to form adequate expectations, including in terms of deadlines.

Finally, look at how the team is doing with the engineering culture. Is there a demand for unification; how highly is the existence of a common picture of what is happening in the processes valued. This can become a blocker. Help that is not asked for is violence.

Ultimately, DataHub is not a magic pill. You can and should try other systems; see what works best for you.

The main thing is to look for growth points for working with data and to balance efforts and potential benefits from innovations. Don't run ahead of the locomotive, but don't cling to the status quo either, choose the right moment for improvements. And what tool to use in the end is a secondary matter.