in and out, a 20-minute adventure. Yandex report

My name is Olya, I am a developer at Yandex Infrastructure and I am making a web player – a library for playing videos on various Yandex services (for example, Kinopoisk, Disk, Praktikum and Pogoda).

This is the story of how we abandoned an open source solution for streaming video and wrote our own wheel. I will tell you about the architecture of our solution and what we encountered when implementing it. I will also show what experiments we conducted and what metrics we focused on.

Point A

To deliver content to the user, we use streaming technology. There are several generally accepted streaming standards that, for example, determine how the stream is divided into segments and distributed over the network. The most popular of them are HLS and DASH. Their main difference is that in HLS we have two critical trips over the network to the first frame, and in DASH – one. For us, this was the key point in the choice, but there were also several additional considerations.

There is also a difference in the developer. HLS is developed by Apple, it is a closed standard. DASH is developed by the DASH IF consortium, and in theory, each of us can become part of this consortium and influence the decision.

But HLS has a huge, huge advantage: it's natively supported, including Safari on iPhones, some Android devices. DASH also has a small, interesting advantage. Because it's a consortium-developed standard, the consortium has to get together occasionally, discuss, and hang out. They do this about every quarter.

So, we chose MPEG-DASH. We still love MPEG-DASH and try to increase the share of DASH in our services. But we need to somehow lose to DASH: we need a library that will implement this standard.

If you search for open source libraries that implement the standard, you will see a huge list. If you ask YandexGPT and ChatGPT, they will also give a long list.

But if you look at the list, you'll see that many libraries play DASH at the expense of other libraries. For example, Video.js has Dash.js running under the hood. JW Player runs on top of Shaka-player. MediaElement.js, Flowplayer, Kaltura Media Player also run on top of some other library. Cineast, Plex, and MSE are just weird things that show up in a search engine.

So, we can type in different queries, get huge lists, which will have some new names each time, but in fact, under the hood they will have three main libraries. These are shaka-player, dash.js and the rarely remembered rx-player.

Dash.js vs Shaka Player vs rx-player

Okay, we have three libraries to choose from. How are they different?

  • Developer.

    • Dash.js is being developed by the same DASH IF consortium as an ideal reference client for the MPEG‑DASH standard.

    • Shaka Player is developed by Google. And no, YouTube does not work through shaka, it is a separate project. I think they keep this open source project to collect various bugs, so that the community brings them reports, and they can collect problems more widely.

    • Rx-player is being developed by the French telecom Canal+.

  • Implementation of the standard.

    • Dash.js is a reference library, so it implements the entire MPEG‑DASH standard.

    • Shaka Player and rx‑player can afford to implement only part of the standard.

  • Because of this, the dash.js build weight differs significantly from the build weight of other libraries. In the table, I provided the production minified build after Gzip.

  • Support for other standards. All three support MSS. Shaka Player stands out here, it also supports HLS.

A few years ago, we had two options to choose from: rx-player wasn't really considered, as it wasn't a very popular library and probably hadn't caught our eye at the time. In the end, we chose Shaka Player, as it weighed less than dash.js. And here it's worth talking a little about Shaka and moving on to the consequences of this choice.

So, Shaka Player. Let me remind you that Shaka can play DASH, HLS and MSS, but MSS is only possible for VOD. In general, it can play both VOD and Live (including Low-latency live), and also supports a variety of DRM systems: PlayReady, Widevine.

The library has its own peculiarities. The first peculiarity: its architecture is designed in such a way that a decent part of the blocks can be replaced with the help of injections — dependency injection. It is compiled and minified with the help of Closure Compiler. Closure Compiler has its own peculiarities: this library uses its own modular system. Instead of familiar imports and exports, there are goog.provide and goog.require.

In addition, for a long time in Shaka Player, types were described by a special annotation for JSDoc. Therefore, types for TypeScript had to be described independently.

I will dwell on the architecture separately. It is quite simple, the main blocks are shown in the diagram taken from the Shaka website itself.

What to pay attention to:

  • Player is our entry point;

  • the associated AbrManager is responsible for ABR (Adaptive BitRate – dynamic quality selection during playback). It is separate from everything and is practically not connected to other blocks;

  • StreamingEngine is an engine that works with MSE;

  • NetworkingEngine — network layer;

  • DrmEngine for DRM implementation;

  • ManifestParser, which, accordingly, parses the manifest;

  • Manifest is a block that manages tracks, monitors audio, video and subtitles.

In general, we have figured out our situation, where we started several years ago. So what did not suit us in the chosen library?

What's wrong with Shaka Player and what we had to improve

Own network layer

We took advantage of the ability to replace Shaka blocks and wrote our own custom network layer to walk through it for segments.

What are the problems with this? Part of the DASH standard is a scheme with multiple BaseURLs. They are needed to be able to switch to another edge server.

This is necessary if the player tries to download a segment from a specific CDN. The CDN was unable to deliver the data within a certain time, for example, 10 seconds. Perhaps something went wrong, the server is overloaded, or repairs are underway. Let's go to another CDN for the data.

But Shaka had this feature. Let's say we go to the first CDN for the first segment. It didn't respond. We went to the second one. The next time we go to the second segment, it would probably be cool to remember that something was wrong with the first one and go to the second one right away. But no, Shaka doesn't remember that, it doesn't have a state for such a module. It goes to the first CDN again, times out again, the user is stalled at that time, and only then switches to the second one. We basically just made this stateless implementation stateful.

In recent years, a standard called Content Steering has emerged to manage switching between CDNs from the server. Some might think that this contradicts the previous scheme, but it does not.

What appears? If in the previous scheme the switching was controlled by the client (i.e. the player), then in the new scheme the server part appears. Now the backend can control the player switching. The player periodically goes to the Steering Server, asks if the list of current BaseURLs has changed. If the situation has changed, the player switches to the CDN recommended by the server.

We plan to take the best of both worlds: storing the state with banned hosts ourselves and accessing the backend for data, which is better to use now.

Location tag

For live streaming to work, the player needs to periodically re-query the manifest to find out about new parts of the video. According to the standard, the Location tag allows you to change the link to the playlist. And the problem was that Shaka got the value from the Location tag once, and then not the next time. We fixed it.

Preload

Our biggest pain point is preloading. We want some content to load before the user hits play.

A product example is binge watching. When a user is watching a TV series and the current episode is about to end, we preload the beginning of the next one. Technology that facilitates binge watching.

But the Shaka Player architecture did not allow for multiple streams to be played. We had to create our own rather hacky scheme:

  • To play the current episode, a controller and the first Shaka Player instance are created.

  • To preload the next episode, a PreloadController is created with a second Shaka Player instance. It is the one that goes around the network for the second episode data and puts it in the cache.

  • When switching to the next series, we throw away everything that was before and create a new controller with a new Shaka Player instance and a new network layer. The new network layer takes preloaded data from the cache. The essence of the cache is a singleton on the page.

The thing is that Shaka Player immediately puts the downloaded data into MSE. It turns out that the page uses three video tags for one player:

  • The first for the current series.

  • The second one is only for preloading. It is not needed and will not be built into the DOM. But due to the Shaka architecture, it will be created and connected to MSE. MSE will be loaded with data that we are not going to show to the user yet. Because of this, decoders will start working, and the device will be loaded in vain.

  • The third one for the next series.

At the moment when it is necessary to switch to preloaded content, the player throws out old Shaka Player instances and creates a new Controller, a new Shaka, a new video tag.

Let's compare the Shaka Player approach with the hls.js approach: you can create an instance and load data separately. And at the right time, connect to the video tag and transfer them to MSE.

It turns out that there is a kind of virtual buffer, into which we can pump data, and only then transfer it. This is very cool, this is exactly what we wanted to get, but without HLS.

ABR

Another thing that didn't suit us was ABR. I talked about ABR algorithms in more detail in in another report.

We took the opportunity to replace ABR in Shaka with our own. But there were nuances here. The thing is that Shaka operates with variants – a composition of video and audio qualities.

This feature is related to the fact that Shaka works with two standards: DASH and HLS. And to unify work with these standards, DASH supports variants. In our custom module AbrManager, we had to decompose the variants back into separate arrays of audio and video qualities.

In addition, we noticed that changing the Shaka config resets ABR completely. And we have comments in our code not to pull certain methods too often, otherwise ABR and capping break. That is, we had to use human resources to monitor that we are not doing something strange.

Besides that, we were not satisfied with BandwidthEstimator. It is not bad, it is a standard implementation generally accepted in open source libraries.

And another point that confused us. ABR is called at a time that is arbitrary for us. Shaka calls it when it needs to, and we would like to be able to control this moment.

Shaka calls ABR:

  • at the time of initialization;

  • before loading the next segment;

  • to change the Shaka config. This point did not suit us.

But here is not the entire list of ABR calls:

  • change audio track;

  • select video track;

  • change DRM config;

  • download timeout. If the server does not respond within a certain time when trying to get the next segment, Shaka cancels the request and recalculates the quality. This happens if the user's network bandwidth has dropped, and the previous choice no longer fits into it. We would like to move away from this scheme in favor of the one described below.

We wanted to be able to yank ABR at any point during downloading. Throughout the entire segment download, ask ABR: is the quality currently downloading still suitable for us? Or is it more profitable to switch to a lower quality and finish downloading it faster? That is, we wanted to abandon the timeout in principle and keep the request for the segment of the required quality for as long as we consider necessary.

This is the final list

This is the final list

Low latency

Another point that was not satisfactory in Shaka was Low latency. When we needed to make broadcasts with low latency, Shaka already had raw developments similar to MVP. My colleagues already they toldhow we sawed Low latency, so I’ll tell you briefly what we had to do in the player to get this scheme to work.

Our Low Latency works on top of CMAF. What happens is that in a typical setup, we have one moof, an MP4 atom with metadata, and one mdat, an MP4 atom with frames, in a segment.

CMAF segment is a detailed segment, there are several moof and several mdat, that is, several MP4 atoms containing frames. This is done so that we can transfer frames to MSE before we finish downloading the entire segment. To do this, we need to read data from the stream. We have already developed our fork of Shaka Player, and our version did not yet have reading data from the stream. We had to bring it ourselves.

The data is not transmitted over the network as complete MP4 atoms. Because of this, you can't just paste part of an MP4 atom into MSE – if the atom is incomplete, MSE simply won't be able to process it. So we had to implement additional MP4 parsing logic in the network layer to handle streaming data correctly.

Then it turned out that if we start playing content before we have downloaded the entire segment, our BandwidthEstimator breaks. We need to somehow take this situation into account and take into account the segment preparation time — we are waiting for data not because the user has a bad network, but because the data is not ready yet. To do this, my colleagues invented a scheme in which we add another MP4 UUID atom, and the server time is transmitted in it. We taught the player to read this atom and transmit the received value to BandwidthEstimator.

And I had to tweak the ABR a little and adjust the smoothing coefficient α in the formulas EWMA to assess network throughput.

But that's not all. While we were implementing Low Latency, we introduced several bugs.

Bugs We Created

The first bug: our backend sometimes returns empty mdat atoms. That is, MP4 atoms that should contain frames are empty. Although this is acceptable according to standards, Safari browser does not cope with it. As a result, we had to rework the parsing of MP4 atoms.

The second bug was more unpleasant. At some point, we allowed a race condition, due to which sometimes the init segment of the wrong track would be added to MSE. For example, we planned to play 1080p, but the init segment for 720p ended up in MSE. As a result, the picture fell apart, and the user saw a green screen.

To solve this problem, we had to change a lot: we added a lot of logs to Shaka Player, built a custom version, integrated it into our player, and ran a series of experiments. This process turned out to be extremely labor-intensive and complex.

There is one more point. We have implemented server capping – a special mechanism with the help of which the backend via the manifest informs us of the quality above which we cannot play.

When many users are watching, for example, the World Cup, our backend is under high load. In such situations, it tells us that auto quality should not exceed 720p. In theory, the user can still manually select a higher quality, but the player's ABR does not select it automatically. It turns out that playlist parsing is done both in Shaka Player and in our implementation of AbrManager. So we are doing double the work.

But that's not all.

What We Couldn't Do in Shaka

The first is MSE in Web Worker.

The player is a fairly heavy library that significantly loads the user's page. It would be great to move tasks such as requesting segments, downloading the manifest, parsing it, parsing segments, and transferring data to MSE to a parallel thread using Web Worker.

This has only recently become possible. In 2020–2021, MSE demo pages were developed in Web Worker. In Chrome 105, this feature appeared under the experimental flag, and starting with Chrome 108, it is available by default.

The demo is available via QR code and link: it simulates a high-load page

The demo is available via QR code and link: it simulates a high-load page

Using MSE in Web Worker looks very promising, but there are no plans to support this feature for Shaka Player yet. Earlier this year, a developer asked on GitHub about plans to support MSE in Web Worker, to which he received the answer: if this feature is needed, you need to create a pull request.

Another thing we wanted to implement was a virtual buffer and caching. We wanted to be able to download two, four, ten minutes, or even a whole movie for offline viewing. However, this is not possible because Shaka immediately downloads data to MSE, which leads to limitations of this API. I think many are familiar with this table:

The browser limits the amount of data we can put into MSE. Yes, it’s enough to download five minutes of 720p video with good sound, but it’s not enough. For example, this became one of the obstacles to implementing BBA (Buffer-Based Approach) — an adaptive bitrate algorithm based on monitoring the buffer fullness, not the network bandwidth. We would like to have a system where there is a virtual buffer into which we can download data in any volume at our discretion, and then transfer it to MSE.

Another point is buffer swapping. Although this case is rare, it would also be worth handling. Imagine that a user is viewing a page with a player in a small container and, thanks to a good Internet connection, manages to download a significant amount of data. Then the user switches to full-screen mode.

It would be great if at the moment of switching to full-screen mode with a good internet connection it was possible to re-download the buffer in a quality that corresponds to the new screen size, throwing away the current buffer. However, this is difficult to implement, since Shaka Player loads data directly into MSE. To implement this feature, it was necessary to patch the network layer again with unknown side effects.

We would like to have more flexible rules for ABR that would take into account not only the network state, but also other parameters. For example, such as TTFB (Time To First Byte). We would also like to take into account the degree of buffer fullness and other factors. In other words, we strive for ABR to provide more opportunities for customization and obtaining additional data.

In the end, Shaka Player is a great library, but it became too small for us.

We were not happy with the fact that Shaka operated with variants. We were not happy with the fact that Shaka Player immediately loaded content into MSE, which is why we encountered all the limitations of this API, could not pump the buffer and use a virtual buffer and caching. Shaka Player does not have MSE-in-WebWorker, and we really wanted to use it. We already had a patched-re-patched network layer, and it would have to be further developed. And since we had accumulated a lot of our own, we could not update to the new version without problems, it was extremely painful. I think you understand that when Yandexoids are unhappy with some open solution, they create their own bicycle.

YaSP

We made Yet another StreamPlayer or YaSP for short.

Requirements for the new engine

When we designed the engine, we had a few key features that we wanted to implement:

  • We wanted our video ads to be able to easily work with us as a video tag. That is, we wanted to extend the video tag so that it plays the DASH standard. As a video tag, Safari supports HLS playback.

  • Preloading.

  • We wanted to move most of the code into a separate thread using MSE-in-WebWorker.

  • Virtual buffer.

  • Buffer transfer.

A little about video tag extension. Our ads work based on the VPAID specification, which is supposed to work with a bare video tag. So we wanted to mimic a video tag that simply supports DASH playback to get streaming video in the ads. And of course, we wanted our code to be universal across different components, ensuring compatibility with native HLS in Safari and MSE in other browsers.

Architecture

How did we implement all these wishes in architecture? We needed a week: to sit in another city, study the materials, draw various schemes. In the end, we got the following.

Here is the front end, the one that is on the main browser page. This includes the video tag, from which the YaSP video element is inherited. Also, the front end of the RPC bus, which provides interaction between the front end of the player and the component running inside the Web Worker. All the interesting stuff is inside the Web Worker.

The entry point is Worker. Inside there is a VideoState block that synchronizes the state of the video tag and the components located inside the Web Worker. The key elements are SourceManager and Source. Our engine was originally designed to work with multiple sources and support preloading. SourceManager controls the number of sources, preventing memory leaks.

Also here is MSEEngine, which interacts with MSE. The virtual buffer allows you to download the required amount of data and clear it in a timely manner. Fetcher is responsible for the network layer, and the Manifesto block is engaged in manifest parsing. Timeline controls audio, video tracks and subtitles. AbrManager manages the adaptive bitrate. The diagram shows the ownership relationships of the main blocks.

I wanted to compare our engine architecture with Shaka's, but it didn't fit on the slide. So in the illustration below, Shaka is shown in the upper left corner, and the part of YaSP running inside Web Worker is shown in the lower right corner. These two systems have similar modules, such as AbrManager and TextEngine.

ManifestParser and Manifest also parse playlists. Fetcher and NetworkEngine are network layers. The main difference is that Shaka does not have a Source component. Shaka is initially designed to work with one source, conditionally with one series. In contrast, YaSP supports work with multiple sources and can preload content.

Pitfalls

Holes. What difficulties did we encounter when implementing all this beauty? Firstly, it turned out that there are holes in the streams. And in Shaka this phenomenon was handled well. But when developing our own engine, we had to solve the problem of holes ourselves.

What is a “hole” in the stream? It is a situation where the end of the current segment does not coincide with the beginning of the next one. Sometimes these gaps are small, and then the browser itself skips them and continues playback – this is a favorable scenario. But sometimes the holes are large enough that the browser cannot overcome them, despite the presence of data ahead. This leads to a hang in the buffering state, and we have to rewind manually.

Reasons for the appearance of “holes”:

  1. Misalignment of segments at period boundaries in a playlist.

  2. Mismatch between the length of the content declared in the manifest and the actual length of the segment. This problem is much more complex, as it can arise due to rounding errors when transcoding or repackaging the content.

It is impossible to get rid of “holes” in the flow completely. It is easier to learn to jump over these moments with the help of GapJumper.

We must artificially rewind content in the following cases:

  • The video tag needs to be in a buffering state so that we can detect when it gets stuck and can't continue playing.

  • We need to make sure there is data ahead to rewind to.

  • And the main thing is that we don't have rewinding in the process. GapJumper and rewinding have their own peculiarities. When the user rewinds back a little, from GapJumper's point of view it looks like a hole, because there is no current data, we are in a buffering state, and there is data ahead that we can rewind to. What solution did we find? When rewinding, we clear the buffer and thus eliminate the hole.

We had a problem with looped videos. It turned out that when the player finishes playing a piece of video and returns to the beginning, GapJumper also perceives this as a gap. The solution was quite simple: we limited the ability to jump further than one segment.

Another problem we encountered was the desynchronization of audio and video downloads. Video is usually larger than audio, so the audio track can download significantly ahead of the video track. For us, audio without video is useless and just takes up bandwidth that could be used to download video faster. The solution here was also simple: we limited the download of each track so that it would not be more than one segment ahead of the other. This way, the audio is not downloaded further than one segment from the current video segment, and vice versa.

DRMAnother thing that worked well in Shaka and that we had to implement ourselves was DRM.

It turned out that the EME (Encrypted Media Extensions) specification was not complete and clear enough. There were questions about how to properly destroy the CDM (Content Decryption Module) and switch between different DRM systems. We tried to make DRM more convenient for our customers by offering a synchronous API, but we encountered problems on TVs. For example, on some operating systems, such as VIDAA, we had difficulties with the code below.

if (buffered.length > 0) {
 buffered.end(0) // exception: buffered length === 0
}

We check for buffered ranges — they seem to be there. However, when we try to access them, an error occurs: it turns out that they actually don't exist. We still haven't figured out how to fix this, so we've abandoned the synchronous API for now and left asynchronous calls.

In the end, our confidence was a little waning. We thought that over four years we had accumulated a lot of knowledge to create our own player. As it turned out, we had a good understanding of what was not working, but we had overlooked many things that were already working perfectly. We were faced with the need to implement these things ourselves and ran into the same problems that the Shaka developers had previously encountered.

Experiments and metrics

To make our assessment of all the experiments we conducted clearer, I will first tell you what metrics we look at:

  1. Errors: fatal and non-fatal. Because our engine was changing, the way we log errors was also changing periodically. So there were technical failures. Sometimes we lost errors, sometimes we got a technical increase in errors, and all of this had to be filtered out.

  2. The number and length of buffering. These parameters allowed us to understand that we have big problems with holes and with GapJumper.

  3. Launch speed. It was this that showed us that MSE in Web Worker also makes sense.

  4. TVT: This is the only product metric we look at here.

  5. QoE. My colleague spoke in more detail about what we invest in Quality of Experience in a separate report.

  6. Traffic. I also talked about how we save it at the previous VideoTech. So, when we switched to a new engine, our task was not to lose the profit we had received, so we looked at traffic, among other things.

Now a little bit of internal terminology. In Yandex, an experiment is considered green if the metrics that should fall fall, and those that should grow grow. The experiment will be gray if nothing really changes there, there are no statistically significant changes. And red if something grows or falls that is not needed.

So, how did we accept our project? We were walking on the gray area of ​​relatively patched Shaka. Maybe we wanted to hear how we launched one experiment, it showed incredible profit, 20% acceleration, 50% TVT growth, all that cool stuff, hair is soft and silky. But no. Our goal was to at least not worsen the user experience.

And one more thing. There was no single experiment, we assessed the results in several stages.

First, we experimented with regular VOD content. Then we added DRM (Digital Rights Management) support to ensure video protection. Since Kinopoisk is a paid service, we started experimenting with it only after debugging the work with regular content without DRM. Then we connected live broadcasts, and thus we had VOD, DRM VOD, Live, and DRM Live. Finally, we implemented support for Low Latency mode.

What else helped us? The dictatorship of unit test coverage. Our manager introduced a mandatory check for jest coverage before merging a pull request. This forced developers to look through the code more carefully, looking for areas where they could write additional unit tests.

We were also lucky: our Shaka-based player had good integration test coverage. When we switched these tests to the new YaSP engine, many bugs were discovered. Integration tests helped to identify small cases that might have gone unnoticed during experiments.

Conclusions

Some key points:

  • Improving Quality of Experience by Including MSE in Web WorkerWe increased the share of “green” (good) sessions by 0.75%.

  • Reducing the number of “red” sessions. Users who previously experienced serious problems no longer experience them.

  • We have laid the foundation for many optimization possibilities. Now you can create various network and buffer related ABR rules.

  • We have implemented a virtual buffer, which allows us to pump large amounts of data.

However, there were difficulties with the implementation timeframe. We expected to complete the project in six months, but in the end it took two.

As part of the retrospective, I asked the executives if they would have taken on this project if they knew it would take two years. And the answer was no, they wouldn't.

What I can say to developers: if you have a similar problem and the business needs super-custom things, most likely you have no choice. You will have to go through this path and face the same difficulties. Later, perhaps, you too will be able to speak at a conference and share your experience. But if you just had a thought: “Why don’t we write our own engine?”, I can only warn you.

That's all, I'm waiting for your questions. Subscribe to our telegram channelwhere my colleagues and I write about streaming and our research.

*Illustrations feature characters from the Rick and Morty series by Justin Roiland & Dan Harmon, The Cartoon Network, Inc.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *