If you eliminate the systematic error of the model, then it’s too late


Machine learning is a technology breakthrough that happens once in a generation. However, with the growth of its popularity, systematic errors of the algorithm become the main problem. If ML models are not trained on representative data, they can develop serious biases that significantly harm underrepresented groups and lead to inefficient products. We studied the CoNLL-2003 data set, which is the standard for creating algorithms for recognizing named entities in text, and found that there is a serious bias towards male names in the data. With our technology, we were able to compensate for this systematic error:

  1. We enriched the data to reveal hidden biases
  2. Supplemented the dataset with underrepresented examples to compensate for the gender bias

The model trained on our expanded CoNLL-2003 data set has a reduction in bias and improved accuracy, showing that bias can be eliminated without any modification to the model. We have published our Named Entity Recognition annotations for the original CoNLL-2003 dataset, as well as its improved version, in open source, you can download them



Algorithm Bias: AI’s Weakness

Today, thousands of engineers and researchers are building systems that teach themselves how to achieve significant breakthroughs—improving road safety with self-driving cars, treating disease with AI-optimized procedures, combating climate change with energy management.

However, the strength of self-learning systems is also their weakness. Since the foundation of all machine learning processes is data, training on imperfect data can lead to distorted results.

AI systems have great powers, so they can cause significant damage. The recent protests against the police brutality that led to the deaths of George Floyd, Breonna Taylor, Philando Castile, Sandra Bland and many others are an important reminder of the systematic inequalities in our society that AI systems should not exacerbate. But we know of numerous examples (reinforcing gender stereotypes image search results, discrimination against black defendants in data management systems of violators and misidentification of people of color facial recognition systems) showing that there is a long way to go before the problem of AI bias is solved.

The prevalence of errors is caused by the ease of their introduction. For example, they penetrate the “gold standards” of models and datasets in open source, which have become the foundation of a huge amount of work in the field of ML. An array of data to determine the emotional mood of the word2vec text, used in building models of other languages, skewed by ethnicityand word embeddings – a way to match words and values ​​​​by the ML algorithm – contains heavily distorted assumptions about the occupations with which women are associated.

The problem (and at least part of the solution) lies in the data. To illustrate this, we conducted an experiment with one of the most popular datasets for building named entity recognition systems in text: CoNLL-2003.

What is “named entity recognition”?

Named-Entity Recognition (NER) is one of the fundamental stones of natural language models; without it, online search, information extraction and analysis of the emotional mood of the text would be impossible.

The mission of our company is to accelerate the development of AI. Natural language is one of our main areas of interest. Our Scale Text product contains NER, which consists in annotating text according to a given list of labels. In practice, among other things, this can help large retail chains to analyze the online discussion of their products.

Many NER models are trained and subjected to benchmarks on CoNLL-2003, a dataset of approximately 20,000 Reuters news article sentences annotated with attributes such as “PERSON”, “LOCATION”, and “ORGANIZATION”.

We wanted to examine these data for systematic errors. To do this, we used our markup pipeline to categorize all the names in the data set, marking them as male, female, or gender-neutral based on the traditional use of names.

However, we found a significant difference. Men’s names were mentioned almost five times more often than women’s, and less than 2% of names were gender-neutral:

This is because, for social reasons, news articles mostly contain male names. However, because of this, a NER model trained on such data will do a better job of choosing male names than female ones. For example, search engines use NER models to classify names in search queries to provide more accurate results. But if you implement a skewed NER model, then the search engine will be worse at identifying female names than male names, and it is precisely this subtle common bias that can penetrate many systems in the real world.

New experiment to reduce systematic error

To illustrate this, we trained the NER model to explore how this gender bias would affect its accuracy. A name extraction algorithm was created that selects PERSON labels using the popular NLP library


, and the model was trained on a subset of the CoNLL data. We then tested the model on new names from the test data that were not present in the training data, and found that the model was 5% more likely to miss a new female name than a new male name, which is a big discrepancy in accuracy:

We observed similar results when we applied the model to the “NAME is a person” template, substituting the 100 most popular male and female names for each US Census year. The results of the model were significantly worse for female names in all census years:

Critically, the presence of skew in the training data tends to bias errors towards underrepresented categories.

The census experiment demonstrates this in another way as well: the accuracy of the model degrades significantly after 1997 (cut points for Reuters articles in the CoNLL dataset) because the dataset is no longer representative of the popularity of names in each successive year.

The models are trained to follow the trends of the data they are trained on. They cannot be expected to be very accurate when they have seen only a small number of examples.

If you’re correcting a model’s systematic error, it’s already too late.

How to fix it?

One way is to try to eliminate model bias, for example by post-processing the model or by adding an objective function to mitigate skew while leaving the details of the model itself defined.

But this is not the best approach for a variety of reasons:

  1. Fairness is a very complex problem, and we cannot wait for an algorithm to solve it on its own. The study showed that training the algorithm to the same level of accuracy for all subsets of the population will not ensure fairness and will harm the training of the model.
  2. Adding new objective functions can harm the accuracy of the model, resulting in a negative side effect. Instead, it is better to ensure the simplicity of the algorithm and the balance of the data, which will increase the accuracy of the model and avoid negative effects.
  3. It is unreasonable to expect a model to perform well in cases where it has seen very few examples. The best way to ensure good results is to increase the diversity of the data.
  4. Trying to eliminate systematic error using engineering techniques is an expensive and time-consuming process. It is much cheaper and easier to initially train models on data without distortions, freeing up the resources of engineers to work on the implementation.

Data is only one part of the problem of systematic errors. However, this part is fundamental and affects everything that comes after it. That is why we believe that the data holds the key to a partial solution, providing potential systematic improvements in the source materials.

If you don’t label critical classes (such as gender or ethnicity) explicitly, then it’s impossible to make sure those classes aren’t a source of bias.

This situation is counterintuitive. It seems that if we need to build a model that does not depend on sensitive characteristics like gender, age, or ethnicity, then it is better to exclude these properties from the training data so that the model cannot take them into account.

However, the principle of “justice through ignorance” actually exacerbates the problem. ML models are great at inferring features, they don’t stop doing so if we haven’t explicitly labeled those features. Systematic errors simply go undetected, making them harder to fix.

The only reliable way to solve the problem is to mark up more data to balance the distribution of names. We used a separate ML model to identify sentences in corpora Reuters and Brownwith a high probability of containing female names, and then tokenized these sentences in our NER pipeline to complement CoNLL.

The resulting data set, which we called CoNLL-Balanced, contains over 400 more female names. After retraining the NER model on it, we found that the algorithm no longer has a systematic error that leads to a decrease in performance when recognizing female names:

In addition, the model improved performance in recognizing male names.

This was an impressive demonstration of the importance of data. By eliminating the skew in the source material, we didn’t have to make any changes to our ML model, saving development time. And we achieved this without negatively affecting the accuracy of the model; in fact, it even increased slightly.

In order to allow the development community to develop our work and remove the gender bias in models built on top of CoNLL-2003, we have posted on

our website

augmented data set in open source, including adding gender information.

The AI/ML development community has issues with cultural differences, but we are mildly optimistic about these results. They hint that we may be able to offer a technical solution to a pressing social problem if we tackle the problem right away, uncover latent systematic errors, and improve the accuracy of the model for everyone.

We are now exploring how this approach can be applied to another critical attribute—ethnicity—in order to figure out how to create a robust system for deskewing datasets that extends to other populations that are protected from discrimination.

It also shows why our company pays so much attention to data quality. If the data cannot be proven to be accurate, balanced, and free from bias, then there is no guarantee that the models generated from it will be safe and accurate. And without this, we will not be able to create qualitatively new AI technologies that benefit all people.


The CoNLL 2003 dataset referenced in this post is the Reuters-21578, Distribution 1.0 test set available for download from the original 2003 experiment design page:



Similar Posts

Leave a Reply