Data quality: How to avoid making subtle mistakes leading to disaster

Today, almost any business actively operates with data. Everything from advertising strategies to AI training starts and ends in data sets. Inaccurate, irrelevant, incomplete data – for business, these are missed opportunities, lost money or competitors that have gone into the gap.

When «bad» data steals revenue

Data quality problems can arise at almost every stage of interaction with them – from entry into the system to storage and use. And the most common cause of mistakes is the human factor. When entering data manually, employees violate formats, make inaccuracies, and do not observe uniformity. One operator writes «Moscow city» the other – «Moscow», etc.

Problems can be hidden in different information systems. For example, when solutions created at different times, with different logic, store the same entities in different ways.

Technical failures are another invisible trap: dates confused during import, time zones knocked down, format inconsistencies – for example, when the «American» date 01/03 is interpreted in Russia as March 1 instead of January 3.

At the same time, errors in data are often invisible until they lead to significant financial losses.

I will share one of the typical cases: the company’s analytical system predicts the outflow of customers based on their requests for support. In the data on which the system was trained, the statuses of applications were used, including the mark «application is resolved» in most applications. Only in fact, they were «solved» only formally – the operator simply ticked a box to close the case. As a result, the model trained on such data does not notice the threat, and the client who has not received help leaves. A business loses a client without understanding why.

The second example from our practice: the forecast of purchases in the FMCG retailer. The model was supposed to predict demand, but duplicate goods, fictitious stock balance, incorrect expiration dates appeared in the sales history. For example, the product «shampoo 400 ml» was introduced with two IDs, although they were the same product. As a result, the system could not assess that these were duplicates. Forecasts floated, the warehouse was littered with some positions, and others were absent.

After the introduction of the MDM system, history cleaning and validation settings, the forecast accuracy increased by 17%, and write-offs decreased by 25%. Tangible growth in business metrics gave not an improvement in the model, but an improvement in the quality of input data. This is a classic example when data is more important than an algorithm.

Quality criteria

The basic difficulty of assessing data quality is that it cannot be done out of context. This is not a universal scale, as on a thermometer. The same set of information can be a valuable asset in one process – and a useless ballast in another one. The data collected for logistics, where the address and delivery time are important, will be uninformative for marketing if it does not contain the client’s age, purchase history and current phone number. This is not an exception, but rather a rule: in real life, data is only «good» when it is fit for purpose.

However, there are also general quality criteria. First of all, accuracy. If the customer number is recorded in error, the system can send a notification to another person. If the payment amount is incorrect, CRM or BI will show distorted revenue.

Another common criterion is data completeness. Missing values hit all stages: from the calculation of KPI to the correct operation of the recommendation model.

Relevance is another fundamentally important aspect. There is nothing more dangerous than beautiful but outdated analytics built on data that describe a situation two years ago.

Consistency is also important – the absence of contradictions between sources, validity – compliance with the necessary formats and permissible values, and, of course, uniqueness – the absence of duplicates.

In general, the quality of data in business today is determined through its suitability for solving a specific problem. This means that the data must not just be correct, but also relevant, complete, consistent and consistent with the purpose of use – be it reporting, analytics or training AI models.

AI doesn’t forgive bias

The most sensitive areas to these criteria are analytics and AI. An AI model cannot be «smarter» than the data it was trained on. It does not correct mistakes, but systematizes them. If there are no specific introductions in the training set, the model simply will not take them into account. If there are distortions in the history of orders, then the predictions will become not just inaccurate, but malicious. AI does not invent meaning, it reflects what it was given. And if you gave a distorted picture, it is it that AI scales.

One of the most insidious enemies of data remains bias, especially in sensitive scenarios: recruiting, lending, and medicine. Bias is a systematic distortion in data that leads to unfair or erroneous conclusions.

Historical data reproduce old practices, including discriminatory ones. For example, if a company in previous decades hired men only for certain positions, the model will assume that this is the «successful» candidate. Simply excluding obvious sensitive signs such as gender or age is not always enough. Bias can be hidden in indirect data, for example, in the postcode – an urgent problem in many Western countries, where zip-code discrimination (discrimination by postcode) is a real factor in hiring, issuing a loan, and obtaining insurance.

It helps here to rebalance the data set, for example: reducing the amount of data by the dominant group or leveling the representation of different categories in the training set. Special metrics are also applied that allow to assess how the model makes decisions evenly and fairly for all groups of users.

But these measures work with a deliberate approach only. It’s not just about improving the model’s accuracy, it’s also about giving it a correct view of the world – without the distortion and blind spots it could replicate and scale.

Inlet control – outlet rescue

Data quality control systems are evolving. Tools are already available that track deviations, conduct automatic validation, and monitor data on the fly. In streaming platforms, you can implement filters that reject known erroneous events before entering storage. Advanced anomaly detectors work at the distribution level, capturing suspicious outliers or illogical connections.

Master Data Management systems play a special role. This is a single «point of truth.» One customer – one record. One product – one article. Regardless of channel, system, interface. MDM harmonizes data, resolves conflicts, makes them comparable. Without it, the struggle for quality turns into an attempt to correct chaos manually.

Evolution of technology and responsibility

Data processing technologies are undergoing a rapid transformation. What was possible yesterday for the largest digital corporations only is becoming an industrial standard today. Automatic checks, constant quality monitoring, and semantic catalogs are no longer innovations, but tools for everyday work.

One of the key changes in recent years has been the introduction of the Data Contracts concept – data contracts between teams. This is not just a formal document, but an engineering artifact, an analogue of API contracts in development. It specifies which fields should be present, in which format the values are transmitted, with what frequency, which deviations are allowed, and what is considered a critical error. Data Contract makes data flow predictable and errors explicit. Violation of the contract is not ignored by the system, as it was before – it leads to an immediate stop of the pipeline or alert. In the era of microservice architecture and Data Mesh (data grid, decentralized approach to data storage), such rigor is not a luxury, but a necessity.

In parallel, the infrastructure of automatic normalization is developing. Modern tools such as Great Expectations, Soda, Monte Carlo or built-in capabilities in dbt allow to run data profiling in real time. The system itself detects drifts – for example, if the share of one value in a category has increased sharply or the field structure has changed. This is no longer the manual work of an analyst, but a built-in mechanism that protects the business from an imperceptible deterioration in the quality of the input.

Along with this, the approach to cataloging is changing. Traditional reference books are giving way to so-called semantic data catalogs – smart catalogs that connect tables and fields with business meaning. This is a full-fledged data map: where they came from, how they were transformed, who is responsible for them, where they were used and how reliable. They show lineage, reflect quality metrics, and track the popularity of tables often used in analytics. Examples of such platforms are Atlan, Alation, Collibra, as well as open-source solutions like DataHub or Amundsen, which are actively used in Russian companies. Within the ecosystems of large players such as Yandex and Sberbank, their own data cataloging tools are also developing. Such platforms are becoming an integral part of the analytical culture in large organizations, replacing the outdated idea of data as something purely technical.

Another significant trend is the integration of quality control into CI/CD pipelines. Data verification no longer occurs «once a month» or «on demand». Whenever the process of loading, processing or transforming data is triggered, tests are automatically triggered. If critical deviations are found, the pipeline stops. It doesn’t just make the system more reliable. This changes the engineering culture: data quality becomes as important a part of a product as its code or interface.

However, even with such progress, automation solves only part of the problem. A machine can capture an anomaly – but not understand its context. It will point out that «something went wrong», but will not say why it happened or how critical it is. The question of whether a sharp change in the share of orders by category is normal or whether an error appeared in the report due to a seasonal factor requires a business interpretation by a person.

Not an IT concern, but a business interest

Perhaps the main barrier to a culture of data quality in a company is the division into «technical» and «business» areas. As long as management perceives data as a byproduct, the responsibility will be blurred. But when you start talking to them in the language of loss, the situation changes.

Errors in discounts, burned advertising budgets, missed customers are often the consequences of poor-quality data. Visualizing these consequences is a powerful tool. One dashboard «before» and «after» cleaning – and the arguments end.

Assigning data owners in specific areas, integrating quality metrics into business reporting, simple feedback mechanisms – all these tools are effective. The «report error» button in the report is more efficient than the regulation in PDF. When the department knows that it is in charge of its piece of data, then real careful work begins.

Data quality is a matter of companies’ survival in the era of automation and the strengthening of the role of AI in business processes. Models are just as smart as the inputs are. The BI report is just as accurate as the data in it. The decision will be effective exactly until the first disinformation. Investing in the architecture, processes and culture of data is a condition for staying in the game and keeping the chances of winning in it.