Translator from a language that cannot be spoken or written

Let's talk about motivation – what prompted us to work on the task and solve this problem. We will also touch on the theory of RSL or sign language – I suspect that few people are familiar with it. We will tell you how we collected our own dataset for recognizing Russian sign language and touch on the topic of training models to solve this problem. We will also share the result with you and tell you a little about the family of our signflow models.

Motivation

In the summer of 2022, our team released the HAGRID dataset, which consists of 19 static gestures and includes half a million static images in total.

We immediately started receiving feedback from the community and questions about whether this dataset is suitable for the task of sign language recognition, and whether there are online translators that solve this problem. If we could definitely answer the first question that no, the dataset is static, and the gestures are dynamic, then we simply did not have an answer to the second question. And we went into research.

In the SBOL application, if you provide documents about your hearing disability, you have the opportunity to call a sign language interpreter-consultant who will answer your questions via video link. Or you can come to the office, having agreed in advance that a sign language interpreter will be invited, and he can level the communication barrier with the consultant.

We ourselves became interested in what if during the week someone needs the help of a sign language interpreter? The magic number 60 is the number of sign language interpreters in Moscow. Although there are about 40,000 hard of hearing people in Moscow, and 10% of them are deaf people.

The next question that arose was whether there was anything at all that would allow us to solve this problem to some extent? But no matter how much we searched, we only found headlines that scientists had created a translator for recognizing Russian sign language. But beyond the text, unfortunately, we found nothing: neither the source code nor the web application.

We also faced another big problem – the lack of good, open datasets.

In fact, there are datasets, but we think they are not very suitable for our task. In the table below, we have collected all the datasets on sign language recognition from all countries.

They are collected in three ways:

  1. Manual method. This means that you invite a sign language interpreter or a native sign language speaker, record them on camera and thus collect data. But this method has many problems – it takes a long time, because each person needs to be recorded separately. And you also need to find and bring such a person. And this is quite expensive.

  2. Downloaded datasets. We found videos on the Internet and managed to collect some part, less than half of the datasets, in this way. People simply parsed sites like Spreadthesing, where there are templates for translating gestures into different languages ​​of the world. Video hosting sites like YouTube also came in handy.

  3. Crowdsourcing. This is when you use a crowdsourcing platform to collect data in parallel. A person can just sit at home, sign up, and provide data. Only one dataset has been collected this way – fluency signers. From this large pool of datasets, only four have been collected to solve the Russian sign language problem.

Let's take a closer look at them.

The first RuSLan dataset was collected in 2020 and consists of 164 classes, which are called glosses in sign language. This means that one gesture can be translated or abstracted as one word from the Russian language. The dataset was collected with the help of 13 users and consists of only 13 video data. The main problem with these datasets is that there are few people in the sample. In order for the models to generalize to showing gestures, they need to be shown a different number of people, and a huge amount of data is needed.

You may notice that we have a record holder in our table – 43,000 videos collected by the fluent signers dataset. But it also has problems – very few classes.

The number of classes can be interpreted as the size of your dictionary, i.e. how many gestures you can translate into Russian words. The record holder here is the RSL dataset. It includes 1,000 classes of gestures that can be obtained.

But if we consider these datasets in order by their advantages, then the RusLan dataset, unfortunately, loses in everything.

The K-RSL dataset consists of a large number of videos, but was recorded by only 10 people and has a small number of words.

The RSL dataset consists of 1,000 words, 35,000 videos. This is quite a lot. But these videos only record 5 people. This dataset will not work because the model will learn only these 5 people and will not be able to predict gestures correctly on other people.

And the last dataset is fluent signers. It contains a lot of videos, but, unfortunately, the dataset has a very small dictionary of only 278 words. But 50 people took part in its creation, which is quite a lot. And we need to collect a dataset that will be ideal in everything and meet all the criteria.

What is RSL and how to understand it

Let's talk about what RSL is and how to understand it. The term “RSL” stands for Russian Sign Language, it is a linguistic system. It has its own vocabulary, grammar, rules. RSL is used for communication between deaf and hard of hearing people. When we researched and looked for approaches to translating sign language into Russian, we discovered a lot of problems. Let's look at them in order.

Dactyl

Dactyl is a direct interpretation of individual gestures in the letters of the Russian alphabet. The image above shows the gestures of the Russian dactyl alphabet. Dactyl is needed to show proper names. There are words for which it is impossible to find a gesture. For example, last name, first name, patronymic, my dog's name, my cat's name, the name, brand of car, metro station, flight name. Dactyl is used for this. But dactyl is also used for those words for which a gesture has not yet been invented.

Leading hand

We are all right-handed or left-handed, and we can do our tasks, for example, write with the hand that is more comfortable. The same is true for the deaf and hard of hearing. They show gestures either with the left or right hand. But here we need a little clarification: if a person starts showing gestures with the left hand, then he should continue showing gestures with the left hand. Otherwise, if he alternates hands, it will be the same as if you and I switch from Russian to English or German in one sentence. They simply won’t understand him.

Dialects

The next and biggest problem of sign languages ​​is their heterogeneity. For example, in Russia there are three main dialects: central (Moscow), St. Petersburg and Far Eastern (Novosibirsk). In different dialects, some gestures are shown differently. But dialects are not the main problem.

Variability of gestures

You can show the same word in different ways. Since sign language appeared long before the Internet, many different dialects were formed. People from central Russia could not communicate with people from the Far East, and they came up with gestures themselves. For example, the gesture for bread is shown in Moscow as a loaf. And in the Far East – as “cut”. It seems to have some logic. Subsequently, these gestures from small dialects gathered into groups of three large dialects. And even before the advent of smartphones, there was a gesture for “write”, which was shown as writing a letter, for example, with a pen. But after the advent of phones, this gesture acquired a new variant in context. For example, write me a letter and write to me on Telegram – these are different gestures, that is, a new gesture “phone, write” appeared. But before the new gesture “phone” appeared, it was shown with dactyl.

Compound gestures

The calf gesture is shown from two simple gestures: small and cow. There is also a menu gesture, which is shown from two gestures – “food” and “list”.

Collecting a dataset

And so we found the problems, it seems we have sorted them out. It is time to collect our own dataset for the Russian sign language task.

We identified several challenges in collecting a good dataset:

  1. Where to look for people who speak sign language – you can't look for them on the street?

  2. How to get the right gesture – a specific variant in a specific dialect?

  3. How to explain the task correctly: why are we doing this and what do we need?

  4. How do we reward people for providing us with this data?

What we did to solve the problems:

  1. We found people who knew sign language and encouraged them using crowdsourcing services like ABC Elementary.

  2. We collected templates by parsing Spreadthesing, which consists of word templates for many languages ​​of the world. We also parsed video hosting sites such as Rutube and Youtube.

  3. We collected a small pool of words from the most common words in the Russian language for which we could not find a sign, and asked translators or native sign language speakers to write down templates.

We write down the rules

We decided to completely abandon composite and dactyl gestures. Dactyl gestures are the letters of sign language. Let us explain why we collected only simple gestures. As you can see, a composite gesture consists of 2 or 3 simple gestures. Dactyl words consist of a set of letters that can also be collected separately, like simple gestures.

Also, in addition to the gestures themselves, we have written down the rules for the data. We need to get a video of a certain format and quality:

  • Location. Hands should not go out of frame, as hands are the most important part of the body that conveys all the information about the gesture.

  • Variations of gestures. We need a gesture that strictly follows the template. As we wrote earlier, we want to get exactly those gestures that are shown on the templates.

  • Speed. People who have known sign language since youth show gestures very quickly. Their speed literally reaches 2-3 gestures per second. We think that if you have seen how deaf or hard of hearing communicate, you could notice how quickly they show gestures. And those people who began to learn sign language later, first think about which gesture should be shown correctly now and only then show it. Just like it happens to you and me when we speak a foreign language. It would be wrong to make people who show gestures slowly, show them faster. Therefore, we asked people who show gestures well to reduce the speed of showing to average.

We set the task

Deaf and hard of hearing people only perceive text because they can read it, so thanks to this and crowdsourcing platforms we have added training. The training consists of several examples of tasks and how to complete them. We have also added an exam and honeypots.

Above is an example of what the exam looks like. We show a video of a person making a gesture and provide a lot of classes with one correct answer. The user is allowed to complete tasks and collect data only when he/she completes more than 80% of the exam tasks correctly. But we all understand that crowdsourcing services do not check people for googling and other types of cheating, so we added honeypots.

Honeypots are the same tasks like “what gesture is shown in the video”, but we know the correct answer in advance. If a user fails several honeypots, he is banned. After all, this means that he either clicks on tasks or does not know sign language.

My colleague Karina spoke about how we labeled the data and prepared it for training the model in her report “Recipe for perfect markup in computer vision“.

As a result, we collected the SLOVO dataset

  • Consists of over 20,000 videos and 1,000 classes.

  • More than 200 people took part.

  • The dataset consists mainly of HD and Full HD videos recorded at 30 FPS to ensure everyone has the same speed.

  • The length of the entire dataset is about 20 hours. And the average duration of one gesture is 1.5 seconds. And this is very long, because in one and a half seconds you can manage to show three or more gestures.

Models for solving RSL problems

Now let's look at what models can be trained on this set.

The first thing that comes to mind is to use models from computer vision. I think many of you have heard about the task of image classification, for example: what is in the image – a cat or a dog. This task is solved by convolutional neural networks or visual transformers, as in the example above.

Visual transformers process the image in pieces, i.e. patches. In the first example, you can see that the image with four patches is processed first, and then, in order not to lose the context of the entire image and collect this information, the patches are reduced and capture the image at the intersection. Thanks to this, the visual transformer begins to understand what is happening in the image.

But we don't have a picture, we have a video. The answer to the question of what to do in such a situation is simple and complex at the same time. We simply add our convolutions and one more dimension, that is, depth. If we had a square matrix for pictures, then for video we have a cube. We captured three frames and selected information about the sequence. Then three more frames, and another, and so on.

The second step was to process the sequence where these frames intersected. As you can see, the same algorithm, the same idea, only a new dimension T was added, that is, a sequence.

The most important thing in training is augmentations, we have two types of them.

  1. “Horizontal flip” – was used to solve the problem of the leading hand – right or left. Our horizontal flip broke the system of right and left gestures and they can be shown randomly: either right or left. This must be taken into account and the data must be correctly understood. For example, here is the heart gesture:

  1. Deterioration of image quality. This is necessary so that the model can be used in future for streaming from poor quality webcams or in low light conditions. You understand that then noise appears if, for example, there is little light, or other distortions. We used the following:

  • Changes in the color scheme, the so-called color jitter. For example, you may have a warm yellow light at home, or there may be flowers on the windowsill illuminated by an ultraviolet lamp and the room will glow blue.

  • Random crop, which captured 80% of the images. We did this to avoid choosing the position of the person in the center of the frame. Since we collected our dataset on crowdsourcing platforms, people recorded themselves on their phones, that is, they put it in front of them, walked away and showed some gesture. But, as we understand, in real life, we can stand a little to the right or a little to the left of the camera. This problem had to be solved. To do this, we fed the model an image with a width and height of 224 * 224. In fact, this is a large number of pixels, and it is quite enough for our visual transformer to extract information from the sequence. We also fed 32 frames of 224 * 224 to the input. A lot of useful information can be extracted from such a quantity of data.

  • We used two types of sampling with a step of 1 and 2. This is necessary, for example, if you collected a dataset with different frame rates. And some videos are collected at 30 FPS, and some at 60 FPS. And for 30 FPS, you process each frame sequentially. And for 60 FPS, you process every other frame, that is, these 60 in 1 second are rolled up to 30 in 1 second. We used this to feed the model either less data or more data to the input, so that it does not overtrain for a certain frame rate. We trained the model 120 epochs. An epoch is one full pass over all the data in the dataset.

So, at the input of our model we feed a tensor of size: channels*sequence length*height*width. We used the MVIT – Multiscale Visual Transformer model for the basic approach.

In fact, the model is quite large and consists of three scale blocks, which reduce the image size by three times. And at the end, we have one linear layer, which solves the main task of sign language recognition.

In a more compact form, it looks like this: we have an encoder consisting of 16 consecutive blocks, which are then grouped into three scales, and one linear layer.

Our model consists of a huge encoder and one linear layer, which in proportions makes up about 0.01% of all the model weights.

And at the output we received the probabilities of gestures predicted by the model.

probabilities of gestures predicted by the model

probabilities of gestures predicted by the model

We trained a good model on our dataset. Good because we drew attention maps and saw how the model looks at the video of the gesture. As you can see, the network looks mostly at the hands. This is the main part of the body that conveys information about the gesture. The first video shows the gesture – “Cup”. The next gesture is “List”. You can see that this is a fairly complex gesture, but the model looks at the hands and pays attention mainly to the movement of the hands as it is shown. The last gesture is “Cake”. It is shown very easily, and the model also looks mostly at the hands.

results

We wrote two papers: one was submitted to a very cool computer vision conference CVPR in Canada and presented online. The second was sent to the ICVS conference in 2023, which was held in Vienna – it was unanimously accepted by the organizers and we talked live about what we were able to achieve.

But after visiting scientific conferences, we thought: If we have a good encoder and, in fact, the task of recognizing sign language is solved by one linear layer, then let's try to train our model to recognize sign languages ​​of other countries.

First, we froze our encoder and trained one linear layer on the WLASL-100 dataset, and in 30 epochs we knocked out SOTA (State Of The Art).

But this dataset has a second variant: WLASL-2000, which consists of 2,000 words. This is a huge dictionary and we had to train the encoder on this dataset. After about a week, we beat all the solutions that were loaded before. And it turned out that we are the best in the world at recognizing American Sign Language.

The model, which was built on Russian sign language, shows excellent results on other sign languages. Therefore, we can proudly put our dataset on par with the coolest and most popular datasets for computer vision tasks: such as Image Net for classification, COSO dataset for detection, Kinetics-700 for action recognition, SA-1B — the dataset on which the Segment Anything model was trained. And our Slovo dataset for sign language recognition is on this list.

Sign Flow Model Family

Let's take a well-trained encoder and remove everything else from the model. This is a very cool feature extractor that understands what's going on in the picture and interprets it well into features. Let's just put a text decoder to it instead of a linear layer, so that our encoder feeds it sequence features, and the text decoder based on the recurrent network predicts the sequence of words.

With the help of such a model, we will be able to solve the problem of translating full-fledged Russian sign language into sentences, and not words individually.

If we can substitute a text decoder, we can substitute a speech decoder. For example, from the Tacotron model.

Thanks to this, we will be able to voice gestures shown by hearing-impaired or deaf people.

Conceptually the idea looks like this.

We have a cool pre-trained encoder in the form of the MVIT network and three heads that predict the sequence of words, direct translation and WaveForm for vocalizing gestures.

But for a full translation from Russian sign language and then from Russian into sign language, this is not enough. We need a 3D avatar that can show gestures, and a model that can predict the sequence of actions.

As a result, we can create a full-fledged system of two-way translation of sign language into Russian and vice versa.

Unfortunately, there are many text translators from different languages ​​in the world, but there is not a single translator from sign language. This is exactly the task we are striving to solve. And as a result, we will be able to remove the barrier that deaf and hard of hearing people now face in the simple need for communication and transmission of information.

What else we wrote about sign language translator:

GigaChat and Russian Sign Language

Recognition and translation of sign languages: a review of approaches

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *