AI still can’t moderate a hate speech

But scientists have learned to identify where the system is failing.


Gone are cozy forums where people-moderators forced participants to follow the rules and communicate in a civilized way. The era of mass social networks requires different solutions. Today, artificial intelligence is taught to separate one swearing from another in accordance with modern ideas about justice. As part of this topic, we want to share a translation of the June publication of the MIT Technology Review about the HateCheck dataset.

Despite all the advances in artificial intelligence language technology, it still falls short of one of the most basic tasks. In a new study, scientists tested four of the best artificial intelligence systems for detecting hate speech. It turned out that all the algorithms failed to distinguish between toxic sentences and harmless ones. And everyone has it differently.

Not surprising. It’s hard to create an AI that understands the nuances of natural language. But what matters is how the researchers diagnosed the problem. They developed 29 different tests targeting different aspects of hate speech to better pinpoint exactly where each algorithm fails. This makes it easier to understand how to overcome weaknesses. The approach is already helping one service improve its system.

18 categories of hate

Study led by scientists from the University of Oxford and the Alan Turing Institute. The authors interviewed employees of nonprofit organizations working on online hate issues. The team used these interviews to create a taxonomy of 18 different types of hate speech, focusing only on written English. The list included derogatory speech, insults and threats.

The researchers also identified 11 non-hate scenarios that tend to confuse auto moderators. This includes, but is not limited to:

  • the use of profanity in harmless statements;

  • insults that the addressees of the statements themselves began to use in relation to themselves (approx. per. – the so-called “Advertising”);

  • hate condemnatory statements with quotes and references to original messages (“hate counteraction”).

For each of the 29 different categories, the researchers wrote dozens of examples and used “boilerplate” sentences such as “I hate [ИДЕНТИЧНОСТЬ]”Or” You are just for me [РУГАТЕЛЬСТВО]”.

Identical sets of examples were created for seven groups protected by US law from discrimination. The team opened source the final dataset called HateCheck. The set contains nearly 4000 examples.

Toxicity services

Researchers tested two popular services: Perspective API developed by Google Jigsaw and SiftNinja from Two Hat. Both allow customers to flag inappropriate content in posts or comments. In particular, Perspective is used to filter content on Reddit and by news organizations including The New York Times and Wall Street Journal. The algorithm marks and prioritizes toxic messages for people to check later.

Of two services SiftNinja treats incitement to hatred too condescendingly, not noticing almost all its variations. In the same time Perspective moderates too hard. He successfully identifies most of the 18 categories of hatred, but sees it in quotes and counterarguments as well. Researchers found the same patterns by testing two scientific models from Google. These models are the pinnacle of available language AI technologies and are likely to serve as the basis for other commercial content moderation systems.

The results point to one of the more challenging aspects of AI detection of hate speech. If moderation is not enough, you are not solving the problem. And if you overdo it, you can censor the language that marginalized groups use to defend themselves. “All of a sudden, you are punishing the very communities that are most often the targets of hatred,” says Paul Röttger, Ph.D. at the Oxford Internet Institute and co-author of the article.

Lucy Wasserman, Lead Software Engineer at Jigsaw, says that Perspective overcomes limitations but relies on human moderators to make the final decision. The process does not scale to larger platforms. Jigsaw is currently working on functionality that re-prioritizes posts and comments based on uncertainty. The system automatically removes content that it believes is hateful, and shows questionable cases to people.

Wasserman said the new study provides a detailed assessment of the state of affairs. “A lot of the things that are highlighted in it, including advertising, are a problem for these models. This is known in the industry, but difficult to quantify, ”she says. HateCheck will improve the situation.

Scientists are also encouraged by the study. “This gives us a good clean resource for evaluating systems performance,” says Maarten Sap, a language AI researcher at the University of Washington. The new approach “allows companies and users to expect improvements.”

Thomas Davidson, associate professor of sociology at Rutgers University, agrees. Due to the limitation of language models and the complexity of language, there will always be a trade-off between underestimating and over-identifying hate speech, he said. “The HateCheck dataset sheds light on these trade-offs,” he adds.

Transfer: Alexandra Galyautdinova

Other publications by Karen Hao translated by Madrobots

  • These strange, disturbing photos suggest AI is getting smarter.

  • Groundbreaking method lets you train AI with little to no data

  • How to sabotage the data tech giants use to spy on you


For Habr’s readers in the Madrobots gadgets store, there is a 5% discount on all products. Just enter the promo code: HABR

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *