how facial recognition systems work

One of the areas of Data Science is face recognition systems. Thanks to them, Moscow cameras recognized one of the most advanced in the world. Catching criminals and entering applications with them is easier, but hiding from justice and impersonating another person is more difficult. Together with expert Vadim Lukmanov, we understand at a basic level where face recognition systems are used and how they work.

Where is facial recognition used and what is it

Face recognition can be understood as one of two tasks:

verification – comparing two photographs to determine whether they are the same person or not;
identification – search for a person in an existing database or file of photographs.

What unites these tasks is that they can be performed by the same neural network.

Verification occurs according to the following algorithm. When a user needs to go somewhere using a face photo, it can be a banking application with biometrics, then the following happens:

the application takes a picture of the user and automatically extracts embedding from the image – a certain vector of numbers;
a photograph of a person from the past is taken in the bank’s database;
the two images are automatically matched;
if the Euclidean distance between two embeddings is less than a certain indicator, then the neural network concludes that the photographs are of the same person and confirms the operation.

Identification is the search for a person by face. In some database or photo gallery, an array of photographs of people is accumulated, as is the case with the same photographs from cameras in Moscow. As a request, a new photo of a person is taken, or rather, an embedding from it, and compared with previous images in the database. If we are looking for a person on the street who is suspected of stealing goods from a store, or we want to identify the offender among the fans at a football match, we take a picture of each visitor at the entrance. And if there is a close enough embedding in the database, then you can refuse the person who is captured in the photo to enter the match or detain him. If the embedding is not close enough, then the person is not in the database, and law enforcement agencies are not interested in him.

Face recognition can be used wherever there are cameras: outdoors, indoors, stadiums, subways and public land transport, airports, train stations, shops, banks and other public places. The corresponding programs and algorithms can be installed on stationary computers and mobile devices, they are also used in the Smart City system, in some intercoms, in applications, banking or others.

Facial recognition is used:

for security: search for criminals, suspects and violators of public order, the impossibility of cracking bank passwords;
to search for the missing;
for advertising: recognition of a client at the entrance with personal offers, checking emotional reactions to a product or service, improving the quality of service;
in retail: contactless payment by a person at the checkout, product offers and discounts based on purchase history, reduction of queues.

How does face recognition work?

The face recognition system is based on a trained pipeline – a sequence of interconnected programs. It has several components:

Face detector. Trains independently from the rest of the pipeline. To train and test the face detector, you need a dataset that contains marked rectangles of faces – bounding boxes – and, preferably, marked key points of the face: eyes, nose, corners of the mouth. If you don’t have your own dataset at hand, you can use the WIDER face public dataset, which contains more than 300 thousand marked faces.

Usually no one comes up with the architecture of the detector, and they just take public ones, for example, MTCNN, Retina Face, SCRFD, Yolov5Face. Depending on the application case, the detector can be additionally trained on its own data and accelerated.

Equalizer. This part of the pipeline is the least important and usually doesn’t need to be trained. First, the face detector predicts a rectangle and face key points, usually five points. Then the aligner rotates and shifts the key points, and with them the face, to the reference position using an affine transformation.
Embedder — recognizer, extractor of embeddings and descriptors. This is the most important part of the program: the trained model weighs the most and runs the longest in terms of time. It needs three components:

Good neural network architecture. The architecture is usually taken from the leaders on Imagenet – one of the most important benchmarks in computer vision, or an ensemble of several architectures in conjunction with knowledge distillation.
Bulk dataset. A dataset usually consists of hundreds of thousands of photos of real people from different angles, ages, genders and races. The dataset can be open source. Now you can hardly find such ones, but we will list MSCeleb1M, VGG2, UMD faces, MegaFace popular in the past. You can assemble the dataset yourself, buy it from specialized companies, or receive it from customers.
Nice loss function. Loss function – usually ArcFace or its improvements. Model training occurs as a classification task

The last fully connected layer is cut off from the trained model, and an embedder is obtained.

Tracker. Needed to track the trajectories of people and is not used in every system. The bank’s mobile application can work without a tracker, while it is necessary for face recognition using street cameras, as it additionally allows you not to launch a face detector on every frame, and in a store it can be used to determine the most popular shelves. Popular trackers are based on the Kalman filter and the Hungarian algorithm.

When recognizing faces in video, embeddings are usually extracted from each frame separately, and then averaged into one aggregated embedding. In order for frames with a poorly visible face to contribute less, a small neural network is often trained, which assigns weights to each frame: the better the face is visible, the higher the weight.

Weaknesses of modern face recognition

Facial recognition systems can already do a lot. But there are nuances.

Firstly, existing systems are characterized by a racial bias: they recognize people of the Caucasoid race better than representatives of other races. This is not a problem of algorithms, but only a problem of existing datasets, on which most of these programs were trained and tested. However, this shortcoming often leads to discrimination based on race and nationality: the systems used in jurisprudence and forensic science so far single out black people as more likely to be criminals.

Secondly, in recent years, people have repeatedly come up with ways to bypass facial recognition, carrying out so-called adversarial attacks. In 2017, Yandex employee Grigory Bakunov invented makeup that deceived artificial intelligence algorithms and talked about it in the media.

This project was based on a genetic algorithm that selected, on the basis of the original photograph, a certain image that was not similar to the original. Based on the data obtained, the makeup artist came up with makeup for a particular person and applied it to the face. Bakunov subsequently closed the project on ethical grounds.

In 2019, Huawei Moscow employees proposed their own way, so to speak, an invisibility cap. For this, rectangular paper stickers were printed on a conventional color printer, which stuck man on a headdress. This made it possible to greatly degrade the quality of face recognition, which was based on ArcFace.

In the long run, these algorithms do not work, they are easy to bypass, and therefore the face recognition system is difficult to “break”. But complicating the work of artificial intelligence with the help of dark glasses, makeup, hair, face masks is still real.

Another pitfall is the so-called live presence detection – liveness detection, when a user wants to log in for someone else. That is, a person wants to be recognized, but identified incorrectly. To do this, they use photos printed on a printer, photos of people on the screen of gadgets, less often – silicone masks with the image of people. To distinguish a static image in the frame from the live presence of a person, it is necessary to develop a separate liveness detector.

Pros and cons of using face recognition systems

The ability to track a person by face greatly facilitates the search for criminals, missing people and abducted people, allows you to enter anywhere by face: offices, educational institutions, public transport, airports, train stations, events, pay for goods and services without touching unnecessary items with your hands and reducing the risk of transmission of infections, use mobile applications without a password or fingerprint. Without facial recognition technology, such a task as finding people in a crowd would not be possible.

But there is a downside: it was facial recognition systems that repeatedly played a decisive role in the search and detention of protesters in Moscow and Hong Kong, as well as during quarantine measures due to the coronavirus pandemic. And what for some people will be a convenient and facilitating innovation, for others it may turn into an opportunity for surveillance and unreasonable control. Lawyers and human rights activists warned about the possibility of such a danger and the accompanying violations of human rights – the negative example of China is known to many.

Experts surethat in the coming years, facial recognition technology will become massive all over the world, which is fraught with the loss of privacy in everyday life. Technologies will improve, but their ethical and prudent use requires a balanced legal framework to protect data and privacy.