How I made a neural network to evaluate images from simple webcams
With roughly this “Kaggle experience,” I tackled my first combat machine learning model from scratch. And the experience turned out to be interesting. But first things first.
Prologue. “A system somewhat similar to Tinder”
Monitoring the quality of online lessons at Skyeng began in 2015 – this is how the quality control department appeared, in which I work as an analyst. The task of our department is to make sure that all lessons are comfortable for the student. To do this, we selectively analyze the audio (this helps to understand how long the student and the teacher speak and in what language), and also control the visual side of the lesson – we evaluate how good the teacher looks, his background, etc.
In particular, at one point we realized that the illumination of the picture from the teacher’s camera affects the conversion in the next payment.
These are not very large numbers, but nevertheless, they also affect LTV. There are lighting norms in education for a reason: it helps in concentration. There is a second, less scientific, but no less important reason – emotions. When you are talking to a person, it is important to see their reaction.
Therefore, you should not be listened to with such a face (thanks for the picture Malik earnest)
In the early years, we were able to analyze most of the lessons by hand. We wrote a system that we dubbed “tinder”: three screenshots of the lesson were sent to 3 different assessors, and they evaluated the image quality:
- swipe to the left – something is wrong with the picture,
- swipe to the right – a good image in the lesson.
But the school grew. In 2019, the number of simultaneously passing lessons began to number in the hundreds. Manual review covered about 5% of the lessons, and the “bad” grade did not give knowledge of what was wrong. It was important for us to quickly understand whether the teacher is distracted by gadgets, communication with strangers, whether the room in which he / she sits is cleaned, whether the whole face is visible and take into account a dozen more factors that affect the perception and impressions of the class.
We decided to automate the process by teaching the machine to immediately “say” what exactly is good or bad in the frame. And, among other parameters, determine the type of lighting on the teacher’s face.
two types of people 3 types of faces
It seems that in the age of instagram, everyone is already accustomed to a juicy image, but getting the same picture from cameras of inexpensive laptops is daunting. And that’s why.
Image quality in particular depends on how much light enters the lens. AND built-in cameras in many laptops look at the overall exposure of the frame. When trying to bring the exposure back to normal, the camera does not always pay “attention” to the person.
So it turns out that if you sit with your back to the window, the camera considers the picture too light and compensates for the exposure, turning you into a dark silhouette.
We call this underexposed.
This phenomenon has an antipode: you are sitting in the dark and all that illuminates your face is a monitor. I think you guessed it: the camera will consider the picture too dark and will highlight the whole, turning the face into a white spot.
Meet – overexposed.
There is also a “normal face type” – that is, everything is fine with lighting.
I needed to teach the car to find problematic screenshots among the total mass of screenshots from the lessons.
We cannot provide all the cameras ourselves. In Skyeng’s case, many of the instructors are contractors, not employees. They can interrupt or terminate cooperation with us at any time. That is, with all the desire, it is not clear how to arrange the transfer of equipment, how to ensure its return, repair and other cases. Plus, it’s trite, with a multiple increase in the number of students (and then teachers), as has already happened, this will add a noticeable burden on budgets and cause enormous difficulties in logistics.
How we looked for a needle in a stack of screenshots
At the time of the start of the project, we had quite a lot of material – screenshots that were sent to the assessors of the quality department. But they still had to be marked.
Realizing that not all teachers have problems with lighting, I wanted not to waste the resource of assessors – but to discard normal screenshots automatically.
This is how the computer vision algorithm based on OpenCV appeared.
Our algorithm selected the face of the person in the picture and turned the picture into black and white. By getting the average brightness of the face and the average brightness of the rest of the image, you can determine the problem: if the brightness of the face is greater, it is overexposed – and vice versa.
But this method had its own problems.
One of the problems was … the walls. For example, if a person with a light face was sitting against a dark wall, the algorithm automatically considered the face to be overexposed. And vice versa – only a very pale person could sit against the background of a white wall. Otherwise, the image was considered problematic. For dark-skinned teachers, the algorithm also regularly detected non-existent brightness problems.
In general, OpenCV is not our way. And being a classic algorithm, it did not always detect the face in the picture. But we have at least narrowed the search radius from millions of screenshots to several tens of thousands. It was already possible to form a training sample from them, and to complete the project it was decided:
- carry out manual validation of the markup material,
- try classical convolutional neural networks – it seems that they will behave more stable in terms of face detection.
You can’t just take and get 5K labeled screenshots
Before that, I always worked with ready-made data sets, so I imagined the markup process approximately. It seemed that everything would be simple: we formulate the problem, give the images to the assessors, get the result, train the model – and you’re done.
I took about 50 thousand screenshots, described the task – and … in a few days we received only 300 screenshots in each category, as well as a lot of feedback in the spirit of “I don’t understand what your overexposed means”. Plus many, many polar assessments for the same screenshots.
But there was already enough material to make it look like an MVP. The model, despite its such accuracy, grasped the necessary features. More material was needed to improve accuracy. During the first round, I realized that: people are subjective in evaluating images, marking tasks should also be formulated according to SMART, and I should be simpler in formulations).
For the second approach:
- I made a test task to screen out assessors: those who could not pass it could not participate in marking up the bulk of screenshots. The test acted as a kind of training – like in games.
- I thought about cross-validation: each screenshot could be assessed by up to 5 people, the final classification was based on the total assessment.
The output was a good result: a csv file with over two thousand examples in each category with confidence percentages.
There was already enough material for the neural network – and I made a model on Resnet.
I trained, adjusted the parameters, experimented with loss, used different penalties, – the model performed well.
Everyone was very happy, and in addition to the validation dataset, they played with their own photos – they uploaded portraits of themselves behind their laptops to make sure that the model works as it should.
How to run screenshots through a model
The instantaneous result of the assessment was not critical: we decided to take screenshots accumulated during the day, turn on some service one-time – run images through the model, get the result and record it as one of the parameters for assessing the lesson. And based on assessments from different sources, build a general report on the teacher’s work for the month.
We use AWS infrastructure, and one of the “obvious” solutions that both Amazon and the collective mind on the Internet are advising is SageMaker.
To be or not to be…
In fact, it turned out that he was not particularly suitable for us. In our case, it would be necessary to push the model into a Docker container, create some kind of API, install an additional server that would submit screenshots to this API, and raise an intermediate database. At the same time, having the API always enabled, we would pay for idle capacity – and nothing would remain of the advantages of the solution.
We have material and we need to process it somehow. It shouldn’t be difficult.
I didn’t want to spend extra time, money and other resources, so we rented an EC2 virtual machine and wrote a simple pipeline:
- turn on the server on schedule
- accessing the database,
- we get a list of screenshots from the last launch,
- upload them to the server, process them at a time,
- we add the results to the analytical database,
- we clean everything.
Made everything simple and easy to maintain.
What is the result?
- I created my first combat model that works with 90% accuracy and covers all of our lessons.
- Several models have completely replaced manual labor, they allow you to track a lot of things that are important from a visual point of view – the presence of a person in the frame, the absence of gadgets and strangers, and other criteria. Among other things, we received a feedback tool from students.
- And I also realized that the 80/20 rule really exists: most of the time I had to spend on communication, preparation and organizational issues, and not on the model itself. But the main thing is that the result is pleasing.