QoE metrics in Yandex video platform

Hi, I'm Vasily Korovin, an analyst at Yandex Infrastructure. For three years now, I've been working on the video platform team and analyzing our player. This is the same web player that is used to play videos on various Yandex services (for example, Kinopoisk, Disk, Praktikum, and Pogoda). And since this year, it has been available in the cloud service for storing, processing, and broadcasting video, Cloud Video.

We have talked in detail about the development of the player here And here. Today I want to tell you how we understand whether people like or dislike using it. For this, we will need the abbreviation QoE – from English it stands for “quality of experience”, and is translated as “metrics of the quality of perception” by users of our service.

Watch the video of the talk from VideoTech on YouTube

A little bit of theory and examples

There is one problem with QoE that is easier to explain with an example. Let's say you meet a colleague and ask: “How much has your boss grown in the last year?” He may tell you, like, five centimeters, but he should have grown seven. And you both know perfectly well what you are talking about.

In the case of QoE, there is no single metric — it is rather a family of metrics, and many come up with something of their own. As is customary at Yandex, we also came up with our own, based on available knowledge.

QoE itself can be divided into two parts. These are Quality of serviceor some objective metrics: what we will take from the logs. And User experience — some subjective perceptions, surveys.

I would like to evaluate even the subjective happiness of users in numbers. And the first thing many come to is Mean Opinion Scoreor MOS. With it, we simply take the arithmetic mean of all the metrics by which the quality of the system's operation is assessed. It can be almost anything.

There are a couple of examples of how this works for subjective ratings. Here, Yandex Go taxi drivers rate me.

As you can see, I am the ideal passenger according to the last 40 taxi drivers.

As you can see, I am the ideal passenger according to the last 40 taxi drivers.

What are the problems with surveys? Firstly, they need to be conducted. This is also an art that we do not always master. Secondly, as Dr. House said, people tend to lie. If you ask: “What do you think of Dom-2?”, then the assessment will most likely be much lower than the real one, because not everyone is ready to admit such love.

We've figured out what MOS is. Can we now express the results of this MOS that we've already obtained using some formulas? In online games, the so-called G-model.

On the left side we see the results of those same surveys, and on the right side a cubic polynomial, where x is a number that is calculated using Ping and Jitter. They say this model gives an accuracy of about 98%. Games like Quake 4 were evaluated with its help.

And there is another example that is closer to our subject of discussion. This is the Huawei metric, which they use to measure video quality. It is based on a five-point, or rather a six-point scale, since it includes values ​​from 0 to 5. Three factors are taken into account here.

  • Content – everything related to content, light resolution.

  • Download – network settings that include download speed and even content selection when you search for it.

  • Playback is an assessment of the quality of the broadcast itself.

This is the formula we get.

U-vMOS = (sQuality-1)(1+ \frac{α(sInteraction-1) + β(sView-1)} {4(α+β)})

These were interesting examples, but why are they needed and what do we want from our metrics?

The origin and evolution of proprietary metrics

Yandex has a lot of dashboards that contain a lot of data. I myself have created more than 10 dashboards in a short time. We wanted to have one dashboard that would have three points:

  • Clear information about the current state of the service.

  • Information that some problems have occurred, preferably with the ability to quickly understand the cause of these problems.

  • Information on how changes in the code from developers affect the operation of the service.

Here it is worth returning to how the video platform is generally structured. We have a client application, it interacts with the Player SDK, which in turn interacts with two parts: this is the CDN and, in fact, the video platform, where the content base, transcoding and everything else is located.

Now let's turn to the stages of platform development to see the evolution of these metrics.

  1. The player appears. At that time, we had almost no logs, and all we could look at was the status of CDN responses. The metric back then was the percentage of successful requests. The pros are obvious – we know what's going on with the CDN. The cons are – we don't understand everything else. So, it's really irrelevant to what we're talking about, because there's almost no player here yet.

  2. Error logging appears. At this stage, the metric is the total number of errors we have. The pros are that we already notice some causes of problems. And the cons are that, in essence, we see that our player is working or not working, and there are no intermediate states. In complex cases, we often don’t know what’s going on.

  3. We start collecting Heartbeat messages. They indicate that the player has worked successfully for 30 seconds or some other time. The new metric is the total number of Heartbeats. It can already serve as a signal of negative or positive changes. Cons – we do not know the reasons for these changes. In addition, it is clear that people watch less at night than during the day. Accordingly, this drop often does not tell us anything.

  4. Let's pay attention to “stalls”. This is a tracing of the English word stalled, which we use to denote delays in video. A new metric appears – the ratio of the stall time to the entire time the player works. It is clear that the larger it is, the worse. We do not have any division yet, we just see one big number. We already understand: something is bad, something is good, but where it is good and what is bad is still unclear.

  5. Finally, when logs took on their modern form, the first important step in this process was taken. We have what is called video viewing session — is a combination of VSID (the identifier of the created player) and videoContentID (the identifier of the content we are watching). If you watched two episodes of your favorite series in one player, then these were two sessions. And we can already consider and evaluate each such session as a separate unit.

The second important step: we have the concepts of events. And there are many events: Start, SetSource, FatalError, PlayerAlive and many others. Here we have a new approach – per-session metrics, when we say: “Let's call all events good or bad and look at the last two events.”

It is clear that Start is a good event, and Error is a bad one. Now we look at the last two events. If the last two events are bad, then the session is bad. If the last two events are good, then the session is good. Pros – we are already pretty good at identifying bad sessions, some clarity is emerging. Cons – we need to constantly monitor the relevance of our event database, because new ones can appear. If we don’t keep an eye on this, something bad can happen. And there is still a problem with good sessions, because something “bad” could happen in the middle.

All this evolution has led us to formulate a new goal in the QoE matter.

So what should the metric be:

  • Simple: preferably something like “it works or it doesn’t work”, without hundredths of a percent.

  • Clear: when we look at it, it is desirable to understand within three seconds what happened.

  • Change-sensitive: This is about caring about fellow developers who make changes and want to see those changes immediately, otherwise their work will be unclear.

  • Correlating with our internal ideas about beauty: in other words, would you want to show such a video session to your mother? Let's say we received some metrics and based on it we say: “Here is a good session” – and it's four minutes long, for example. We cannot take such a sin on our soul.

New understanding of QoE metrics

We brainstormed to understand what our user's negative experience depends on. At that point, we already had quite a lot of data: I wrote down only some of it, that's not all.

From this list, we selected five or six parameters that influence perception primarily.

  • Initialization is the start of the player when loading before the video starts.

  • Interruption is a condition when there are pauses in the middle of a video, and this irritates you.

  • A bounce is a session that had some traffic but no viewing time. It's over before it even started.

  • Presence of fatal errors – here are all FatalError events.

  • Video quality is a multidimensional concept, which we will discuss in more detail below.

For each of these criteria, we believe that we can obtain measurement boundaries to say: with these criteria it is good, with these it is average, and with these it is bad. And at the same time, introduce a color gradation into red, yellow, and green parameters. Let's consider each criterion.

Initialization. This example shows the distribution of initialization time in video sessions. We see that this picture is somewhat similar to an exponential, on which we choose the boundaries of the criteria.

To define the boundaries, we proceed from the consideration that all three categories should be present in our picture: if we declaratively consider everything green, then such a metric is not needed. Further, adequacy according to common sense is important: you cannot say that a session with initialization in four minutes is good. We also take into account responsiveness to our changes.

To determine the boundary of good/bad initialization, we conducted several A/B experiments: we lengthened it and watched how people dropped out. In the end, we chose that in the area up to the 70th percentile we considered initialization good. And everything above the 85th percentile is considered a bad session.

Interrupt buffering. First, the longer the columns, the worse. And the more columns, the worse. These two thoughts can be transformed into a mathematical thing: the number of columns is important to us (the user gets annoyed when there are many small pauses) and the length of the maximum column (annoyed when there is one huge pause). And when there are many pauses, and they are huge – we will not consider this case at all.

Here we also have a distribution built and boundaries adopted: 85th and 70th percentile. You may ask, how did it happen? The curves here turned out to be similar.

Fatal error. It's simple here: if it exists, we think it's bad. If there is no fatal error, we think it's good.

Refusals. Here we are looking again at sessions that have tables. They had traffic, but the watch time is zero, which is considered a bad result. In fact, we don't even dive into the nature of these bounces. We just consider that they are there. We don't know why they happen, but we think it's bad, it shouldn't be like that.

Video quality. There are many options for how to evaluate video quality. This is a task that already has some solutions (like VMAF), but most often they use frame-by-frame comparison. Checking all sessions is quite expensive and resource-intensive.

We wanted to do something that might not be as precise, but would be much easier to calculate. So, to evaluate the quality of the video, we looked at the metric from Mux. It is an assessment of the image we show from 100 to 0. To understand how it works, we need to understand the concept of Upscale.

Let's say you're watching a movie in a container, and the video inside has a certain quality. If the container is smaller than the video quality or equal to it, we think everything is fine. If the container is larger than the quality, the picture is stretched, and Upscale is calculated using this formula.

If your picture is twice as big, then Upscale will be counted as 1. Quite a simple formula, I think.

But there is a question. Let's say we have movies in our Kinopoisk collection that are not of very good quality, simply because they are old (let's say you are looking for some French New Wave). For this case, we have the concept of MaxQuality – the highest possible quality for our content. Let's complicate the formula a little:

We added MaxQuality to the formula along with the container size and will now select the minimum. If the container is larger than the maximum possible image, we will not penalize for this, but will take the maximum possible image.

It would seem like a sound idea. But there is another possible case: there are people who set low quality for themselves. For example, I came home and turned on lower quality to save traffic. Or maybe I forgot my glasses and I don’t need good quality if I don’t see the difference, for example.

For such users, we have a concept called User Capping, which is preserved when choosing. If the user has set a low quality for himself, we also do not want to penalize ourselves, because this is his conscious choice. We add a third component to our minimum, and we get the following Upscale formula:

We get this formula for each moment of time, calculate the average Upscale for the entire period and the maximum Upscale.

Now comes the Mux formula for calculating VQS – video quality score. If Upscale was zero, the score will be 100. If Upscale was higher, the score will be lower.

We did A/B experiments and watched people drop out. It turned out that if we show poor quality, people are actually willing to tolerate it a little bit, but we are not. That's why we introduced such boundaries as 95 VQS points – this is already a yellow session, and 87.5 and below – this is already a red session. It corresponds to about a fifty percent stretch.

How is the final metric calculated?

We calculated six parameters for each session and came up with an ingenious formula for the overall session score.

If at least one parameter is red, we say that the entire session is red. If everything is green, then it is all green. So we build a daily graph of the distribution of red, yellow, green sessions.

We get a picture where we can see what is happening in any section that concerns us.

All developers can open this dashboard and see what exactly in their player has become better today than it was yesterday.

The advantages of this approach are simplicity and flexibility. Flexibility concerns the fact that we have chosen some boundaries that we can move. Let's say we have reached all green sessions, getting rid of the initialization, because we were able to make it very small. Then we can move the boundaries of green and fight with ourselves again. It is not important that we run faster than someone, it is important that we should be better than yesterday. Such is the Eastern philosophy.

If something goes wrong, we can open a page where we have all the evaluation parameters listed separately and find which one failed.

A small example from life. One day, this happened to the guys. As we can see from the general dashboard, everything went bad.

Literally three seconds later we open the next page and find that it is the video quality metric that has deteriorated.

Literally in a few minutes it turned out that it was in short videos that everything started with very poor quality. And poor quality for short videos shows a very large Upscale, which will have a strong impact. This is what happened.

In the future we want to add some more things here: sound, network quality, user experience, dropped frames.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *