Netflix Engineer’s Life Is A Case Of The Extra 40ms

6 min


The Netflix app works on hundreds of smart TVs, streaming remotes and Pay TV boxes. A partner engineer helps device manufacturers launch the Netflix app on their devices. In this article, I will address one particularly tricky issue that has blocked device startup in Europe.


How the weirdness began

Towards the end of 2017, I attended a communications conference where they discussed the issue with the Netflix app on a new set-top box. This box was a new Android device with 4K playback based on the Android Open Source Project (AOSP) 5.0 – “Lollipop”. I’ve dealt with multiple devices on Netflix, but this was my first Android TV.

All four market players participated in this conference: a large European pay TV company (operator), this company launched the device; the contractor who integrated the set-top box firmware (integrator); chip manufacturer; I represented Netflix.

The integrator and Netflix have already completed strict Netflix certification, but during an internal test of the operator of the TV channels, one of the company’s executives spoke about a serious problem: Netflix playback on his device was interrupted, that is, the video was played for a very short time, then it stopped, played and again interrupted. This did not always happen, but, as a rule, a few days after turning on the device. The bug looked terrible in the video shown.

The integrator found a way to reproduce the problem: you had to launch Netflix several times, start playing, then return to the device’s user interface. We were given a script to repeat the situation automatically. Sometimes it took five minutes to reproduce the situation with a script, but the bug always showed up.

Meanwhile, a field engineer at the chipmaker diagnosed the cause of the problem: Netflix’s Ninja Android-TV app was slowly feeding audio data. Video stopped due to depletion of a buffer in the audio unit pipeline. The video froze while the decoder waited for data from Ninja. When new data came in, the player came to life.

The integrator, chipmaker and operator decided that the problem had been identified and their message was clear to me: Netflix, this is a bug in your application, fix it. I heard tension in the voices. The set-top box was too late to hit the market, the device had exceeded its budget, and people were waiting for results.

Investigation

I was skeptical. The same Ninja app works on millions of Android TV devices, including smart TVs and other set-top boxes. If the bug is in Ninja, why does it only appear on Android 5.0?

With the help of a script from the integrator, I reproduced the problem myself. I contacted my colleague at the microcircuit seller, asked if he had seen something like this before (he had not seen it). Then I started reading the Ninja source code because I wanted to find the chunk that was passing the data. I did a lot of research, but got confused about the playback code and needed help.

Heading up the stairs, I found the engineer who wrote the audio and video pipeline in Ninja, and he took a tour of the code for me. I devoted some time to the source to understand its working parts; Also, to solidify my understanding, I added my own logging. The Netflix app is complex, but in the simplest case, it transmits data from a Netflix server, buffers a few seconds of video and audio data on the device, and then delivers video frames and audio samples to the playback hardware one at a time.

Figure 1 – Simplified playback pipeline

Let’s talk about the audio / video pipeline. Everything, up to the decoder buffer, is the same on all set-top boxes and smart TVs, but moving audio and video data to the decoder buffer is a device-dependent procedure and runs on its own stream. The job of the move routine is to keep the decoder buffer full through a Netflix API call, this call provides the next frame of audio or video data.

In Ninja, this work was done using Android Thread… There is a simple state machine and logic to handle different playback states, but on normal playback, the stream copies one frame of data to the Android playback API and then tells the thread scheduler to wait 15ms and call the handler again. When you create an Android thread, you can request that the thread be restarted as if in a loop; but this is the android thread scheduler and it calls the handler, not your own application.

To play a video at 60 frames per second (Netflix’s highest frame purity), the device must render a new frame every 16.66ms, so it checks for a new sample every 15ms. That’s enough time to keep ahead of any Netflix video stream.

The integrator determined that the problem was in the audio stream, so I focused on the specific stream handler that was delivering the audio samples to the Android audio service. Where are the extra milliseconds?

I assumed that some function called by the handler was to blame, so I wrote logging throughout the handler, assuming that the culprit would become apparent. It soon became clear that there was nothing wrong with the handler: even when playback was interrupted, it completed in a few milliseconds.

Insight

In the end, I noticed three numbers: the data transfer rate, the time the handler was called, and the time when it transfers control back to Android. I wrote a script to analyze the logging output and built a graph that answered my question.

Figure 2 – Visualization of audio bandwidth and synchronization of a stream processor

The orange line shows how quickly data moves from the streaming buffer to the Android audio system, in bytes per millisecond. Three behaviors can be seen in this diagram:

  1. Two high spiked parts where the baud rate reaches 500 bytes / ms. This is the buffering phase before starting playback. The handler copies the data as fast as it can.
  2. The area in the middle is normal playback. Audio data travels at about 45 bytes / ms.
  3. The stutter area is on the right, where audio data is moving at closer than 10 bytes / ms. It’s not fast enough for playback to continue.

The inevitable conclusion – the orange line confirms what the engineer of the chip manufacturer said: Ninja is slowly transferring data. To understand the reason, let’s see what the yellow and gray lines indicate. The yellow line shows the time spent in the handler subroutine itself, this time was calculated from the marks recorded at the top and bottom of the handler.
In both normal playback and stuttering playback, the time in the handler was the same: about 2ms. Peaks show cases where execution has slowed down due to the cost of other device tasks.

The root of the problem

The gray line, the time between calls to the handler, indicates otherwise. When the video is playing normally, you can see that the handler is called every 15ms. When the video is interrupted (right), the handler is called approximately every 55ms. There is an extra 40 ms between calls, which means that it is impossible to keep up with the playback. But why?

I told the integrator and chipmaker about my discovery (look, the Android thread scheduler is to blame!). But they continued to resist: why don’t you just copy the new data when the handler is called? The criticism was fair, but rewriting the code in this way would entail more changes than I was willing to make, so I decided to continue looking for the root cause.

I dived into the Android source code and learned that Android threads are a user-space construct, and the thread scheduler uses the epoll () system call to determine the time. The performance of epoll () is not guaranteed, so I suspected that something systemic was affecting this function.
And here I was saved by another engineer of a chip supplier who discovered a mistake; this bug was fixed in the next version of Android – Marshmallow. The Android Thread Scheduler changes the behavior of threads depending on whether the application is running in the background or in the foreground. Background threads have an additional latency of 40,000,000 ns. A bug deep in Android itself meant that extra time occurs when the thread is brought to the fore.

Usually the audio processor thread was created when the application was running in the foreground, but sometimes the thread was created a little earlier. This happened when the Ninja app was running in the background and then the player would stop.

Lessons learned

This was not the last bug fixed on this platform, but it was the one that was the most difficult to track down. The bug was hiding outside of Netflix, not in the playback pipeline; however, all raw data indicated an error in the Netflix app.

This story shows one of the sides of my favorite job: I cannot predict all the problems that our partners will throw at me, in addition, I know that in order to solve problems I need to understand several systems, work with wonderful colleagues and constantly force myself to find out what something new. My work directly affects people, their pleasure from a great product. I understand that when people enjoy Netflix in their living room, I am the person on the team who makes it possible.



0 Comments

Leave a Reply