how bad fuel slows down business intelligence systems
I wrote this note to delve into the issue of data quality as such. In my work, I deal with quality control, but basically it is about the quality of physical objects (for example, welds), but in some cases also about the quality of the data collected about these products (completeness of coverage, absence of gaps, compliance with regulatory documents).
IN survey results, published in the summer of 2021, shows the results of a survey of employees of more than 230 companies working with data: analysts, data scientists, developers, and managers. From the results presented, several important conclusions can be drawn about pain points and trends in the set of applied tools for data management.
One of these pain points, which is expected, was the provision of data quality. An elegantly designed mechanism, such as a financial accounting system, a traffic planner, or the human brain, inevitably loses its effectiveness due to incorrect, distorted, and incomplete input data. You can read about cases with the brain in the books of Oliver Sachs, and almost every report at conferences such as DataQuality.
Both in the construction of a nuclear power plant or a gas pipeline, and in the construction of analytics, reporting, forecasting, and customer management systems, it is necessary to form a budget for quality assurance, including the quality of data entered into business intelligence systems. How noted David Taber “data validation is a cost line that exists as long as the data exists.”
At the same time, according to a long-standing study by Larry English (Larry English. Information Quality Applied: Best Practices for Improving Business Information, Processes and Systems, John Wiley & Sons, 2009), from 15 to 35% of the organization’s annual budget is spent inefficiently due to poor data quality. And in organizations focused on the provision of services (such as banks, insurance companies, government agencies), the volume of losses reaches 40%.
What conclusions can be drawn from the survey data?
quality and reliability are the main KPI of the team working with data;
about the failures of the implementation of IT solutions tied to data, it’s time to conduct separate Failconf
experts complain about the excess of manual work
Quality and reliability are the main KPI of the data team
The conclusion was made based on the analysis of the survey results – “What KPIs are used in your work?”. Improving data quality comes first, followed closely by data availability for all stakeholders and improved communication and collaboration.
Performance improvement tasks were not even included in the top five KPIs. It turns out that technology, as it often happens, is ahead of people – those who merge data into the “lake”.
It is important that three-quarters of data quality problems come to the team from outside – from a third party, from other teams. Among other things, problems with the quality and ordering of data also arise when businesses merge (for example, when one blue bank joined another Moscow bank) and it was necessary to create a single data warehouse.
Exactly the same problem arises in the interaction, for example, of two digitalized ministries. To do this, both the tax authorities and the security forces must speak the same language.
At the same time, more than half of those surveyed report that they do not have at hand either procedures or tools for data quality control. Therefore, data quality checks are performed manually, that is, informally and unpredictably.
FAILCONF : Fails due to data quality
One big failure associated with the incorrectness of the results of forecasts, with errors in reporting, is enough to discourage management from wanting to invest in data analysis, and users not to trust the output of systems and double-check everything manually.
For example, a transportation planning system may show aircraft loading at 130% of nominal and consider this to be normal.
Some of the classic data quality issues are:
errors in data formats (confusion in units of measurement and date and time formats, there is a known case of failure in the execution of the satellite flight program due to incorrect conversion from signed to unsigned when transferring data between subsystems);
formal attitude of employees to entering the initial data (copy-paste);
absence or incompleteness of general dictionaries;
the problem of translating from different human languages (how, for example, to correctly translate “LLC” – “LLC”, “LLC”, “Ltd” or “LLP”?);
lack or insufficiency of an audit of historical data (for example, a change in the legal status of a counterparty);
checking data for the possibility of transfer to the outside world (personal and other confidential data).
Challenges in trying to use AI to diagnose COVID-19
Everyone knows about the principle of garbage in – garbage out in the application of machine learning. However, even for such an urgent task as COVID diagnostics, there were problems with data quality (from review):
Chest scan data from children who have not had covid as examples of what non-covid cases look like. As a result, AI learned to identify children and not covid.
The scan data included patients in the supine position (these patients were more likely to be seriously ill). As a result, the AI learned to incorrectly predict the severity of the risk of coronavirus, based on the position of the person (and if the patient was standing, then the severity of the lung injury was simply ignored by the AI).
AI has learned to focus on the text font used by different laboratories to label images. As a result, fonts from hospitals with a higher load became predictors of the risk of coronavirus infection. Of course, when preparing the data, it was necessary to remove the labeling.
Medical images are marked according to how they were identified by the radiologist, and not according to the results of PCR, thus undesirable biases (bias) are introduced into the original dataset.
Adequate digital twins
The ambitious task of creating digital twins of industrial enterprises, and even more so of a digital state, in the absence of data quality control, lays a potential ideal bomb.
It will be difficult to find the culprits if, when designing and applying a risk-based approach (RBI) system to ensure the operation of an oil refinery, due to an error in indicating the nominal thickness, the time period for the next diagnostics of one of the pipes is incorrectly calculated and it is this pipe that bursts at power. This is also a data quality issue.
Imagine a perfected system for product cost calculation, which, of course, is based on information about labor costs. All the work of an analyst goes down the drain if employees reflect working hours in Redmine once a week and from the bullshit. So it turns out some kind of bolt with a cost of tens of thousands of rubles.
Another problem with data quality is that when data starts moving between storages, turns into a blockchain, it can be difficult to identify low-quality data. in time. Therefore, the challenge is to clean up and fix the data as quickly as possible, before it spreads and replicates to different places.
Too much manual work
The answer to the question – “What are the biggest challenges to improve the efficiency of your work?” “Too much manual work.”
Manual work is required when contributing, cleaning data, checking dependencies, editing and testing ETL code. A typical task is to clean up lists of customer addresses given in different formats.
At the same time, about half of the respondents do not use special tools for Data Governance and quality control, a third use self-written tools. More than half check the quality of the data themselves, another quarter seek advice from others, and only 12% are confident in the quality of the data.
Say a word about data quality
The statistics above are taken mainly from the survey results. employees of 230 US-based companies.
How about us? I invite you to share in the comments examples of failures in the implementation of data projects due to dirty data and success stories with cleaning and correcting such data.
On the one hand, isn’t this problem far-fetched?
On the other hand, is it possible to create a universal data quality control tool that works out of the box?