February 14, 2020
Princeton University, Department of Engineering.
To solve the problems of bias in artificial intelligence, computer scientists have developed methods for obtaining more reliable data sets containing images of people. Researchers offer enhancements to ImageNet, a database of more than 14 million images, which has played a key role in the development of computer vision over the past decade.
ImageNet, which includes images of objects, landscapes and, in particular, people, serves as a source of training data for researchers creating machine learning algorithms that classify images or recognize individual elements on them. ImageNet’s unprecedented scale required automated image collection and annotation using crowdsourcing. While the category of images of people from the database was rarely used by the research community, the ImageNet team worked to eliminate bias and a number of other problems associated with images of people, which are unintended consequences of the ImageNet design.
“Today, computer vision works well enough to be implemented everywhere in a variety of contexts,” said co-author Olga Russakovskaya, associate professor of computer science at Princeton. “This means that now is the time to talk about how it affects the world and think about the issues of credibility.”
In a new article, the ImageNet team systematically identified non-visual concepts and offensive categories, such as racial and sexual characteristics, for ImageNet human image categories and suggested removing them from the database. Researchers have also developed a tool that allows users to identify and retrieve sets of images of people that are balanced by age, gender and skin color, in order to facilitate appropriate algorithms to more reliably classify people’s faces and their actions on images. Researchers presented their work on January 30 at a conference on the veracity, reliability, and transparency of the Computing Technology Association in Barcelona, Spain.
“It is very important to bring to the discussion the attention of laboratories and researchers with fundamental technical experience,” continues Russakovskaya. “Given the fact that we need to collect data on a colossal scale, and the fact that it will be implemented using crowdsourcing (because it is the most efficient and well-proven pipeline), the question arises – how do we do this in order to ensure the greatest reliability without stepping on a familiar rake? This article primarily focuses on design solutions. ”
A group of computer scientists at Princeton and Stanford launched ImageNet in 2009 as a resource for researchers and educators. Princeton graduate and teacher Fay-Fay Lee, now a professor of computer science at Stanford, led the initiative. To encourage researchers to create better computer vision algorithms using ImageNet, the team also launched the ImageNet Large Scale Visual Recognition Challenge. The competition was mainly focused on the recognition of objects using 1000 categories of images, only three of which featured people.
Some of the reliability issues in ImageNet stem from the pipeline used to create the database. Its categories for images are taken from WordNet, an old database of English words used for natural language processing research. The creators of ImageNet borrowed nouns from WordNet – some of which, although they are well-defined verbal terms, are poorly translated into a visual dictionary. For example, the terms that describe a person’s religion or geographical origin can only extract the most prominent image search results, which can result in algorithms that reinforce stereotypes.
A recent art project called ImageNet Roulette has drawn attention to these issues. The project, released in September 2019 as part of an art exhibition dedicated to image recognition systems, used the images of people from ImageNet to train an artificial intelligence model that categorized people with words based on the presented image. Users could upload their image and get a tag based on this model. Many of the classifications were offensive or simply unfounded.
The main innovation that allowed ImageNet creators to accumulate such a large database of tagged images was the use of crowdsourcing, in particular the Amazon Mechanical Turk (MTurk) platform, in which employees were paid to check candidate images. This approach, although it was revolutionary, was still imperfect, which led to some biased and inappropriate categories.
“When you ask people to check images by selecting from a wide range of candidates, people feel the pressure to choose something, and these images tend to have distinctive or stereotyped features,” says lead author Kayu Young, a computer science graduate .
In the course of the study, Jan and his colleagues first filtered out potentially abusive or sensitive categories of people from ImageNet. They considered offensive the categories containing profanity or racial or gender insults; sensitive categories included, for example, classification of people based on sexual orientation or religion. To annotate the categories, they recruited 12 graduate students from different walks of life, instructing them to mark the category as sensitive if they are unsure. So they excluded 1593 categories – about 54% of the 2932 categories of people on ImageNet.
Then the researchers turned to MTurk employees for help, so that they rated the “imagery” of the remaining acceptable categories on a scale of 1 to 5. Selecting categories with a rating of imagery of 4 or higher led to the fact that only 158 categories were classified as acceptable and sufficiently figurative. Even this carefully filtered set of categories contained more than 133,000 images – a huge number of examples for teaching computer vision algorithms.
Within these 158 categories, researchers studied the demographic representation of people in images to assess the level of bias in ImageNet and to develop an approach to create more appropriate data sets. ImageNet content comes primarily from image-targeted search engines such as Flickr. Search engines, on the whole, tend to return results that are significantly more representative of men, fair-skinned people, and adults between the ages of 18 and 40.
“People have found that image search results are highly biased in terms of demographic distribution, so ImageNet also has a biased distribution,” says Young. “In this article, we tried to assess the level of bias, and also propose a method that would balance the distribution.”
Researchers have identified and reviewed three attributes that are protected under U.S. anti-discrimination laws: skin color, gender, and age. MTurk workers were asked to annotate each attribute of each person in the image. They classified skin color as light, medium or dark; and by age as children (under 18), adults 18–40 years old, adults 40–65 years old or adults over 65 years old.
Gender classification included men, women, and indefinite gender – a way to include people with different gender expressions, as well as annotate images in which gender cannot be perceived by visual signs (such as images of many children or scuba divers).
An analysis of the annotations showed that, as in the search results, ImageNet content reflects a significant bias. People marked as black, women, and adults over 40 were underrepresented in most categories.
Although the annotation process included quality control and required that annotators reach consensus, because of concerns about the potential harm of incorrect annotations, researchers chose not to issue demographic annotations for individual images. Instead, they developed a web-based tool that allows users to retrieve a set of images that are demographically balanced in the manner specified by the user. For example, a complete collection of images in the programmer category may include about 90% of men and 10% of women, while in the United States about 20% of programmers are women. The researcher can use the new tool to obtain a set of images of programmers representing 80% of men and 20% of women – or even individually, depending on the goals of the researcher.
“We don’t want to talk about how to balance demographics, because it’s not a very simple problem,” says Young. “The distribution may be different in different parts of the world – for example, the distribution of skin colors in the US is different from the distribution in Asian countries. Therefore, we leave this question to our user and simply provide a tool to extract a balanced subset of images. “
The ImageNet team is currently working on technical updates to its equipment and the database itself, in addition to implementing face filtering and the rebalancing tool developed in this study. ImageNet will soon be reissued with these updates and a request for feedback from the computer vision researchers community.
Princeton Ph.D. Clint Kinami and associate professor of computer science, Jia Dang, co-authored with Young, Lee, and Russakovskaya. The study was supported by the National Science Foundation.
Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, Olga Russakovsky. Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020 DOI: 10.1145 / 3351095.3375709