What is Data Service and why it can be useful to you

How this happens with us is described below the cut!

When working with data in storage, we use several approaches simultaneously:

Centralized data management using Big Data. For example, when you need to consolidate data from different products for, for example, building a recommendation system or launching an advertising campaign.
Data management within product teams. When we need to be sure that we have received high-quality data and can use it.

In this article we will talk about the first approach – working with data from a Big Data storage. Go!

Things of bygone days: where did the data warehouse in MTS begin?

Once upon a time, less than 10 data source systems were integrated into a warehouse. All data flows were supported by verbal agreements between the supplier team and data consumers.

Gradually, there were more than 100 sources, and more than 3,000 download processes. The operation team was faced with the acute issue of manageability and support for integrations both from the source and from the storage side. The source system did not miss the opportunity to change the type or format of the transmitted field and did not warn the storage about this. Because of this, integration fell, and the operation team, instead of a morning cup of coffee, ran around and looked for the reason.

There were also a lot of consumers: more and more product teams and analyst groups were using the repository. And, if the object changed (also unexpectedly), the operation team had to think: “Won’t something break at the colleagues’ place? For example, reporting, models, advertising campaigns?” And then send a mailing to all consumers of the storage, since the consumers of the changed object were usually not known.

In addition, the Big Data team independently checked the quality of the data, because erroneous and low-quality data could affect the work of colleagues.

Go to Data Service

We realized that we couldn’t live like this anymore. Teams within the company need to create standardized and transparently managed data-driven services that everyone can understand.

This means we need a Data Product, or Data Service, as we called it. You can read more about it here And here. He will assist in the creation and distribution of data in analytical and operational scenarios.

For us, Data Service is data that:

needed by more than one consumer;
located in a repository or integration platform;
“packaged” in accordance with the standards defined in the Ecosystem to ensure the quality, availability and ease of use of this data.

Creating a Data Service involves following the key principles of working with data:

Data Ownership
The owner of the data is the product that generates it. Ownership refers to the management and responsibility of data in accordance with standards and commitments.
Datacentricity
Data is the core asset on which products, services and processes are built. They give birth to it.
Using platforms
The solutions and products we create for all typical management, development, delivery and operation tasks use XaaS platforms of the following categories: PaaS, Product, DevOps, ArchOps, Integration, SRE/Monitoring Data, DataOps/MLOps, Security.
Machine-readable documentation
Documentation in general and the Data Service contract in particular should be drawn up in the form of special manifests and similar documents with semi-structured data in a standard format.
Participation
Solutions are divided vertically into a combination of autonomous and functional partitions at all levels (Teams, Front-end, Back-end, Data, Infrastructure). These layers consist of sets of loosely coupled components and services (UI/Service/Data-Mesh) maintained and developed by autonomous cross-functional teams.
Zero trust
All users (within and outside the organization) must undergo authentication, authorization, and ongoing security checks of their account, devices, and network environment. Without this, they cannot gain and maintain access to applications and data.

All this can be implemented using DataOps platform tools, about which you can read in the material of our colleagues.

The main difference from the previous approach is standardization and productivity. Data Service is no longer “some kind of integration” or “we provide data to Big Data”. Any MTS product, for example KION or Strings, can create and publish its own Data Service so that other products can easily and quickly use it. Now we’ll tell you exactly how.

Where did we start?

Before we started implementing Data Service, the standard process for developing data marts in a warehouse looked something like this:

There was no single approach to how a Data Service is designed, what artifacts should appear at each stage of its development, and what requirements they should meet.

Our first Data Service

The company usually pilots all proposed approaches, which is what we did. The first Data Service was a product classifier CVM and services, collected from three different systems. Let's look at the example of a cashback service. It was chosen because many related areas use it for various purposes, for example:

to manage loyalty programs – awarding bonuses and cashback to users for certain actions;
when calculating the effectiveness of managing marketing campaigns and advertising campaigns (we look at which campaign or promotion achieved its goal);
when analyzing customer involvement in using cashback;
in tracking transactions on bonuses and cashback for the correct classification of services connected to subscribers (to correctly determine the response of subscribers).

The classifier has many consumers, but they all received data from it upon request in the form of an Excel table.

Key properties of Data Service

To move to Data Service, we decided to highlight several key properties, and this is what happened:

Discoverability: Data is useless unless it is discovered. The Data Service should be easily located in the data catalog and have described metadata to show the provenance, owners, and source of that data.

Before: The classifier was previously created and stored in the marketing center, and only a select few knew about it.
After: The classifier has become available in the data catalog, and any user can find it. The classifier's metadata is described in accordance with company guidelines, which helps future consumers find it in the data catalog.

Addressability: The data service must have a unique naming and location in accordance with the rules we define (naming convention). This is necessary so that users can gain access by accessing a specific interface.

Before: the classifier was created in Excel format and transmitted upon request to those selected employees.
After: the classifier has become available as a separate service in the database, and users, when agreeing on access, can set up a regular systemic receipt of up-to-date data from it.

Data quality: If the data is not trustworthy and true, it is useless. Service owners must take responsibility for data quality and adhere to approved service levels.

Before: the quality of data in the classifier was thought about only when a problem arose ad hoсwhen consumers came and pointed out the error or noticed it themselves.
After: we have formulated and automated mandatory checks (more on them later), which will promptly detect and highlight errors found in the data.

Our plans for the future of Data Service include:

Secure and managed global access control: This may seem obvious, but it's a must. Access can be managed centrally, but service owners can then assign access at a granular level to fine-tune the needs of teams.

So, the basic Data Service architecture looked something like this:

We offer only the highest quality data

Let us dwell in more detail on a very important property – data quality. Users must be confident that the data meets a minimum level of quality.

To track this property, we formulated three requirements that we now extend to all our Data Services:

Relevance: Data must be updated regularly.
If a date is missing from a data set, we recommend that users generate a technical date that reflects the time the data was loaded.

For example, in the CVM classifier, the data must be updated every day from 20:00 to 23:59. A check launched at 05:00 the next day must find at least one record in the data set for the previous date.

Uniqueness: Data should not be duplicated. This is necessary for their correct interpretation.
The data must have a business key defined: a field or fields that uniquely identify a record. There should not be more than one entry for a business key. We recommend that users of our services do not include technical record identifiers in the business key, which do not carry any meaning.

For examplethe classifier should not have more than one entry for product_ID and service_ID.

In this example, the business key is Product_ID and Service_ID. There is no point in checking the technical ID of a record in the classifier: it is generated automatically and under no circumstances can it be duplicated.

Completeness: The data must not have missing values.
All attributes that have value according to business logic must be filled in. At a minimum, we recommend checking the completion of the attributes that make up the business key and the attribute with the business or technical date to check for relevance.

For examplethe fields ID_product, ID_service and status must be filled in.

The latest results of data quality checks are always available in the data catalog. If a deviation is detected, the incident is registered in the catalog for the responsible team.

One of the important goals for us is not to increase the load on products for creating a Data Service according to all the rules.

Now the Data Service concept is still in the process of launching, we are recording requirements and processing feedback. Here are a few LESSONS LEARNED:

Working with Data Service requires certain skills from our colleagues. To train everyone quickly and painlessly, we worked through a competency matrix and prepared a course for a corporate university;
in situations where Data Service is needed, but there are no resources (human, technical, financial, or all at once), we suggest using communal (general, third-party supported) resources for piloting. This is enough to try the solution, and based on these tests it is easier to defend the development of the solution;
Automating monitoring of data quality requirements can be time consuming. When we launched the first Data Service, the product team spent almost 3 hours creating basic data quality checks. This was unacceptable to us. To reduce implementation time in the DataOps Platform, we created a check template. It helps you configure all Data Quality requirements in half an hour using the enterprise tool DataOps.DQ. The Data Service owner defines parameters for each requirement and fills out the appropriate section of the template.

An example of filling in the verification parameters:

{% set objects = [
  {
    "database": "data_service",
    "schema": "public",
    "table": "dm_product_service_link",
    "timeliness": true,
    "date_filter": "date_trunc('day',\"Date\")=current_date - 1",
    "uniqueness_columns": [],
    "complex_uniqueness_columns": [("product_id", "service_id")],
    "completeness_columns": ["product_name", "link_id", "product_category_name", "product_valid","product_id", "service_id", "link_base_connection", "link_status","service_global_code", "service_name", "service_source_id"],
  } ] %}

We plan to record the Data Service parameters in the Data Contract. It represents an agreement between the source, the consumer, and the platform on which the Data Service is published.

That's all – we're ready to answer your questions in the comments! We will tell you how to issue a Data Contract for a product in the following articles.

What is Data Service and why it can be useful to you

Things of bygone days: where did the data warehouse in MTS begin?