Migrate mission-critical traffic at scale without downtime

Hundreds of millions of users log onto Netflix every day expecting a seamless, immersive user experience. However, there are many systems and services involved in delivering the right experience. To meet and exceed customer expectations, these backend systems are constantly improved and optimized.

When migrating systems, one of the main tasks is to ensure a smooth transition of traffic to the updated architecture without negatively impacting the user experience. In this post, we'll look at the tools, techniques, and strategies we used to achieve this goal.

The backend for the streaming product uses a highly distributed microservices architecture, so migrations also occur at different points in the service call graph. They can occur at the edge API system serving client devices, between edge and middle-tier services, or from mid-tier services to data stores. Another important factor is that migration can occur on stateful APIs or those APIs that do not capture request data and are idempotent.

We have divided the tools and techniques used to facilitate these migrations into two high-level phases. The first phase involves verifying functional correctness, scalability and performance, and ensuring the robustness of new systems before migration. The second phase involves migrating traffic to new systems in a manner that reduces the risk of incidents and ensures ongoing monitoring and confirmation that we are meeting critical metrics tracked at multiple levels. These include Quality-of-Experience (QoE) measurements at the client device level, Service-Level-Agreements (SLAs) and business-level Key Performance Indicators (KPIs).

In this article, we'll take a closer look at replay testing, a common technique we've used in the pre-test phase of several migration initiatives. In the next part of the article, we will focus on the second stage and take a closer look at some of the tactical steps we use for controlled traffic migration.

Testing with replicated traffic

By replayed traffic, we mean production traffic that is cloned and redirected to another route in the service call graph. This allows you to test new/updated systems under conditions that simulate real production conditions. With this testing strategy, we perform a copy (replay) of production traffic for the existing and new versions of the system in order to conduct appropriate checks. This approach has a number of advantages:

Traffic replay solution

A replay traffic testing solution consists of two main components.

  1. Traffic duplication and correlation: at the initial stage, it is necessary to implement a mechanism for cloning and redirecting production traffic to a newly created route, as well as a process for analyzing the correlation of responses from the original and alternative routes.
  2. Benchmarking and Reporting: After duplicating and correlating traffic, we need a mechanism to compare and analyze the responses received from the two routes, as well as a comprehensive report on the analysis results.


Testing framework using replicated traffic

During the migrations, we tried different approaches to duplicating and recording traffic, making improvements to them. Among them there are options when the generation of reproduced traffic is organized on the device, on the server, and using a special service. We'll look at these methods in the following sections.

Organize on device

In this option, the device makes a request for the production route and the traffic-recording route, and then ignores the response from the traffic-recording route. These requests are executed in parallel to minimize possible latency along the production route. The choice of route to record traffic on the back-end server can be determined by the URL that the device uses when making the request, or by specific request parameters in the routing logic at the appropriate level of the service call graph. The device also includes a unique identifier with identical values ​​on both routes, which is used to analyze the correlation of responses from the two routes. The responses can be written in the most optimal place in the service call graph or by the device itself – depending on the specific migration.



Device Driven Playback

The obvious disadvantage of the device-driven approach is that we waste device resources. There is also a risk of impacting device QoE, especially on low-resource devices. Adding extensive logic and complexity to device code can create dependencies on application release cycles, which are typically slower than service release cycles, leading to migration bottlenecks. Moreover, allowing a device to execute untested branches of server-side code may expose the system to an attack surface.

Organization on the server

To solve the problems associated with the device-driven approach, we used another option in which playback problems are completely resolved on the backend. The reproduced traffic is cloned and transferred to the top-level service. The top-level service simultaneously calls existing worker services and new services to replace the old ones to minimize the increase in latency on the production route. The top-level service records the responses on the two routes along with an identifier with a common value that is used to match the responses. This write operation is also performed asynchronously to minimize any impact on latency on the production route.



Server Driven playback

The advantage of the server-driven approach is that all the complexity of the playback logic is encapsulated in the backend, and device resources are not wasted. Additionally, since this logic is server-side, we can iterate through any necessary changes faster. However, we still insert playback-related logic next to production code that handles business logic – this can lead to unnecessary complexity. In addition, there is an increased risk that bugs in the playback logic may affect production code and metrics.

Dedicated service

The final approach we used is to completely isolate all components of the replay traffic into a separate dedicated service. With this approach, we asynchronously record requests and responses for the service that needs to be updated or replaced into a self-contained event stream. Quite often, such logging of requests and responses already occurs to obtain operational information. We then use Mantis, a distributed stream processor, to record these requests and responses and replay the requests for a new service or cluster, making any necessary adjustments to the requests. After replaying requests, this special service also records responses from the production route and the traffic recording route.



Dedicated Replay Service

With this approach, the rendering logic is concentrated in an isolated, dedicated codebase. In addition to the fact that this approach does not consume device resources and does not affect its QoE, it also reduces the coupling between the production business logic and the traffic replay logic on the backend. Additionally, it allows you to decouple any playback framework updates from device and service release cycles.

Analysis of reproduced traffic

Once we have run the replicated traffic and recorded a statistically significant volume of responses, we are ready for benchmarking and reporting. Given the scale of data generated by replayed traffic, we record responses from both sides in cost-effective cold storage using Apache Iceberg technology. We can then create autonomous, distributed batch processing tasks to compare responses on the production route and the traffic recording route, and generate detailed reports of the analysis results.

Normalization

Depending on the nature of the system being migrated, responses may need to be pre-processed before comparison. For example, if some fields in the responses are timestamps, they will be different. Likewise, if the answers contain unsorted lists, it is better to sort them before comparing them. In some migration scenarios, there may be intentional changes in the responses generated by the updated service or component. For example, a field that was a list in the original route is represented as key-value pairs in the new route. In such cases, we can apply certain transformations to the playback path response to model the expected changes. Based on the nature of the system and the corresponding responses, there may be other specific transformations that can be applied to the responses before comparing them.

Comparison

After normalization, we compare the responses from both sides and check whether they are the same or not. The batch job creates a high-level summary that captures some key comparison metrics. These include the total number of responses from both sides, the number of responses combined by a correlation identifier, matches and mismatches. The summary also records the number of passed and failed responses on each route. This summary gives a good idea of ​​the analysis and overall hit rate of the production route and the playback route. Additionally, in case of discrepancies, we record the normalized and non-normalized responses on both sides in another large data table along with other important parameters such as the difference. We use this additional log to debug and identify the root cause of problems that lead to inconsistencies. Once we have identified and corrected these issues, we can use a retesting process to reduce the nonconformity rate to an acceptable level.

Lineage

When comparing responses, a common source of noise is the use of non-deterministic or non-edempotent dependency data to generate responses on the production and traffic-recording routes. For example, imagine a response payload that delivers media streams for a playback session. The service responsible for generating this payload accesses the metadata service, which provides all available streams for a given title. Adding or removing threads can be caused by a variety of factors, such as identifying problems with a particular thread, enabling support for a new language, or introducing new coding. Consequently, there is the potential for discrepancies in the flow sets used to determine the payload for the production route and the traffic recording route. This leads to discrepancies in answers.

To solve this problem, a complete set of data versions or checksums is compiled for all dependencies involved in generating the response, called a lineage. Inconsistencies can be identified and discarded by comparing the response history of both production and replay responses in automated response analysis jobs. This approach reduces the impact of noise and provides an accurate and reliable comparison between production responses and playback responses.

Real-time traffic comparison

An alternative method of recording responses and making offline comparisons is real-time comparison. With this approach, we fork the reproduced traffic on the top-level service, as described in the “Organization on the server” section. The service, which forks and clones replay traffic, directly compares responses on the production route and route to the traffic record and records the corresponding metrics. This option is possible if the response payload is not very complex so that the comparison does not result in a significant increase in latency, or if the services being migrated are not on a critical path. Logging is done selectively in cases where the old and new responses do not match.



Analysis of reproduced traffic

Stress Testing

In addition to functional testing, the reproduced traffic allows us to stress test the updated system components. We can regulate the load on the traffic recording route by controlling the amount of traffic being reproduced and the horizontal and vertical scaling factors of the new service. This approach allows you to evaluate the performance of new services under various traffic conditions. We can see how availability, latency, and other system performance metrics such as CPU consumption, memory consumption, garbage collection speed, and so on change as load factor changes. System load testing using this technique allows you to identify performance hot spots using real production traffic profiles. It helps identify memory leaks, deadlocks, caching problems, and other system problems. Also allows you to configure thread pools, connection pools, connection timeouts and other configuration parameters. It also helps define reasonable scaling policies and estimate the associated costs, and also implies broader trade-offs between costs and risks.

Stateful systems

We have used replay testing extensively to provide confidence in migrations using stateless and idempotent systems. Replay testing can also confirm the correctness of migrations associated with stateful systems, but additional precautions must be taken.

The production route and the traffic recording route must have separate and isolated data stores that are in identical states before allowing the traffic to be replayed. In addition, all the different types of requests that control the state machine must be reproduced.

During the recording phase, in addition to the responses, we also want to capture the state associated with a particular response. Accordingly, in the analysis phase we want to compare both the response and the associated state in the state machine. Given the general difficulty of using replay testing in stateful systems, we use other techniques in such scenarios. We will consider one of them in the next article in this series.

Conclusion

At Netflix, we use traffic replay testing for numerous migration projects. One recent case study was testing an extensive rearchitecture of the edge APIs that drive the replay component of our product. Another example is migrating a mid-tier service from REST to gRPC. In both cases, traffic replay testing facilitated comprehensive functional testing, load testing, and system tuning at scale using real production traffic. This approach allowed us to identify sticky issues and quickly build confidence around these significant changes.

Once testing is completed using traffic replay, we are ready to begin implementing these changes into production. In a future article, we'll look at some of the methods we use to incrementally implement significant system changes in production, while managing risk and building confidence through metrics at different levels.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *