Import substitution of Data Quality stack in petrochemicals: SIBUR experience
SIBUR has a lot of data that flows in real time from numerous sensors at different production facilities; this data needs to be collected, stored, processed and analyzed so that the company can make the right business decisions. And the profitability of production and the profit of the company as a whole depend on the quality of the infrastructure for working with data, and these are vital indicators.
In a short series of two articles, we will analyze SIBUR’s experience in creating, supporting and developing a DQ (Data Quality) service for DWH (Data Warehouse) in the context of sanctions and disappearing vendors of proven and familiar solutions.
Will talk about this experience Alexander BergerLead DQ Analyst at Digital SIBUR, who was lucky enough to lead the process of creating a DQ service based on the solutions of a vendor who decided to leave the Russian market in the midst of the work process.
What is Data Quality?
In short, the DQ service is a set of tools for checking the quality of data entering the warehouse, transferred between storage layers, or already stored in the DWH.
The quality of the data must be checked in order to understand whether this or that data can be trusted or not – this is critical, because management decisions are made on the basis of this data and the cost of an error in this process is very high, especially in the industrial sector, of which SIBUR is a prominent representative .
Modern information technologies are widely used in production processes at SIBUR, and data is constantly generated in our production circuits – equipment, sensors, all kinds of automation, IoT – sending data 24/7 that needs to be transmitted, collected, processed, stored.
It is important for SIBUR to monitor the quality of data; business profitability depends on it. Therefore, the company allocates resources to develop the DQ service for DWH.
The essence of our DQ service can be seen in this diagram, we post it here as a teaser, but to dive into the details of its architecture.
In this article we will talk about the burning issue of import substitution of solutions that we used in our DQ service before the departure of the vendor.
Urgent import substitution
Before the well-known events of 2022, we built DQ on SAS Data Quality. At SAS, we had a process for checking data quality, we identified areas of responsibility and planned to begin the process of training business colleagues to interact with the service. Incoming data checks were set up by a quality engineer working on Vertica. And by February, we began implementing SAS Decision Management, a tool for self-service data quality checks.
But what happened happened, and we began looking for a replacement. We needed to find something that would meet our requirements and that would not disappear overnight due to sanctions, or switch to a subscription that we would not be able to pay from Russia.
Market analysis
The first thing we encountered was the lack of ready-made commercial domestic solutions on the Russian market for our stack. Therefore, calling what we did “import substitution” can be done with reservations: we replaced, yes, but not with domestic solutions, but with what we could.
The situation was complicated by the fact that we did not understand in what time frame we could solve the problem in such conditions. We could not take advantage of the experience of colleagues from Russian companies who used OpenMetadata: there was a lot of stuff that needed to be completed, and we already had our own working catalogue.
First steps
We conducted a market analysis, and the most suitable solution for our needs was the open-source Great Expectations (GX) library, which is made in Python.
Advantages of GX that are important to us:
There are many custom checks.
High degree of flexibility of settings and scenarios.
Quite a lively community.
Free access.
After weighing the pros and cons, we chose the GX.
And immediately to the rake
We already had a team that was developing a data catalog. And we started developing a DQ tool as a microservice for the data catalog. But not on our own, but by hiring contractors.
SIBUR, as a large company, can afford to have a separate team to develop its own services. But amid the chaos, we decided to bring in contractors with relevant experience.
On our part, as is now clear, it was a mistake to outsource the development of a microservice. No, they developed everything well, but sooner or later the contractors run out, and we had to say goodbye to them.
There were problems during the handover process, and we ended up losing some of the development expertise. Over time, we managed to increase this expertise, but it took time.
However, the troubles did not end there. And if the story with the contractors can be classified into the category of risks that we recognized and assessed as “Possible” when making a decision, then the events that unfolded further we classified into the category “It is unlikely that this could happen.”
Pure Open Source or fork
The fact is that when working with Open Source libraries there are nuances associated with updates. In general, this is a classic story with open source solutions that they also develop (various security vulnerabilities and architectural flaws are discovered).
And the communities that are associated with these decisions can take development in some direction that we do not need. And it is possible that after the next update some template checks or services will stop working for us. And so it happened.
This has never happened before, and here it is again
At some point, we discovered that new versions of GX lost support for Vertica and Oracle, and this is critical for us. And to the many questions “When will this all end?” they stopped responding because they were focused on developing their cloud-based subscription product.
That is, we have a dilemma:
Use the old version of GX, which has support for the solutions we need.
Make a fork and optimize it to suit our needs.
Look for a replacement.
The problem with the first option is that sooner or later we would still have to fork. Because we want to connect to more systems, we want to scale, optimize, and so on.
And the pipeline for creating a fork is a non-trivial story. Even if the developers of the original version did not configure Vertica in new versions, our chances of implementing this were clearly less. In addition, we did not have the resources to create such a solution ourselves, standardize and support it. SIBUR is a big company, but not that big.
Choosing a solution
We began to analyze what is on the market. Again. After talking with colleagues from other companies, we realized that we are not alone in the GX story and we have similar problems.
Arenadata Catalog is popular in Russia; in terms of writing data quality checks, it is also based on GX. And its developers took the second path – they made a fork. But they have a company that is focused on developing a catalog and data quality checks. And we have a slightly different profile.
And since solutions based on Great Expectations now do not support Vertica and Oracle, they, as mentioned above, are not suitable for us. Therefore, we began to analyze other popular solutions.
We realized that many solutions are often focused on checking data in real time – during loading, and we at SIBUR validate already collected data. In some cases, solutions work directly with Spark, which is not suitable for us. We discarded some tools because there wasn’t enough of a vibrant community there, and some didn’t suit us for a variety of reasons.
Soda vs Great Expectations
Our sample included the DQ tool SODA – it is Open Source. We analyzed SODA, compared it to GX and thought it would be better. Because it's easier to maintain, it connects to Vertica and Oracle – which is what we need.
In SODA there are no problems with connections, because in GX the connection is made using SQLAlchemy, but in SODA it is different – separate connectors are written.
In general, we tested, analyzed, consulted with colleagues in other companies and realized that many were switching to working with SODA. There is an active community there, people use it, no problems.
Instead of a conclusion
Our current position can be seen in this graph. It's funny, but there seems to be some truth in it. The more experience you gain in working with data, the more you come to the conclusion that checking with SQL scripts and developing your own DQ tool from scratch looks more correct.
There are risks of using Open Source and they will not go away, as we learned from our own experience when the developer:
removed the functionality we needed;
went towards the paid model.
It took us a year to try all this specificity for ourselves and understand what didn’t suit us and where we want to move next.
We are currently migrating from GX to SODA, since this option suited us best. But here, too, risks remain: the developer may stop supporting this solution, something may fall off or be transferred to some kind of paid subscription, and we will not be able to buy it, because we are in Russia.
Also, do not forget that SODA is being developed by a company located in the West. And we may find ourselves in a difficult situation again. If this happens, we will look for something in our market or develop our own solution.
But developing our own solution does not look like an attractive idea for us, as for many other companies. Therefore, now the only thing left is to implement open source products.
Summing up the results of import substitution, we can say that we have done everything possible so as not to run into history when something falls off and we have to dramatically redo everything. At least, they tried to organize the architecture in this way. We are building a multi-layer DWH on the Open Source stack, introducing Data Quality tools there – also based on open source solutions.
We will discuss the concept and architecture of our DQ solution, which can be applied in any company, in the next article. If you have any questions on this topic, please ask in the comments.