Server rendering under scaling conditions

To improve the performance of React-based front-end pages, Yelp uses server-side rendering. After a number of production incidents in early 2021, when many pages were ported from Python-based templates to React, it became clear that the existing server-side rendering system did not scale. We share the material on how the problem was solved by the start of the course on Full stack development in Python.

Until the end of the year, we redesigned this system in such a way as to increase sustainability, reduce costs and increase observability for functional teams.


What is “server rendering”?

Server-side rendering is a technique for improving the performance of JavaScript templating systems (such as React). Instead of waiting for the JavaScript package to load and rendering the page based on its content on the client side, we render the HTML of that page on the server side and once the HTML is loaded, we attach dynamic hooks on the client side.

In this approach, the increased data transfer size is sacrificed in favor of increased rendering speed, because our servers are usually faster than the client computer. In practice, this has been found to significantly improve the time components speed downloading the main content.

Status quo

We prepare the components for server rendering by assembling them, along with the entry point function and any other dependencies, in a separate JS file. After the entry point is used ReactDOMServer, which accepts component props and generates rendered HTML. These server render packages are loaded into S3 as part of the continuous integration process.

In the old server rendering system, at startup, the latest version of each package was loaded and initialized, after which it was ready to render any page without waiting in S3 at the critical path.

Then, depending on the incoming request, the appropriate entry point function was selected and called. This approach comes with a number of problems:

  • Loading and initializing each package significantly increased service startup time, making it difficult to quickly respond to scale events.

  • Due to the fact that the service controlled all the packages, a large amount of memory was required. Each time we scaled out and deployed a new instance of the service, we had to allocate memory equal to the sum of the source code usage of each package and the execution time. In addition, all packets were served from the same instance, making it difficult to measure the performance characteristics of a single packet.

  • If a new version of the package were loaded between service restarts, then the service would not have a copy of it. We solved this problem by dynamically downloading the missing packages as needed.

    And also, so that too many dynamic packets are not stored in memory at the same time, we used a caching algorithm with the displacement of values ​​that were not requested for the longest time.

The old system was based on the service Hypernova from Airbnb. Airbnb Blog article about problems with Hypernova. The main problem is that when rendering components, the event loop is blocked, which can lead to unexpected failures of several APIs on Node.

We ran into a similar problem: blocking the event loop caused Node’s timeout functionality to respond to HTTP requests, which greatly increased the delays in processing requests when the system was already overloaded.

Any server rendering system should be able to minimize the effects of blocking the event loop when rendering components.

These issues escalated in early 2021 as the number of server rendering packages on Yelp continued to grow:

  • It took so long to start up that Kubernetes started marking instances as unhealthy and automatically restarting them, making it impossible for them to ever become healthy.

  • Due to the huge heap size of the service, there were significant issues with garbage collection. By the end of the old system’s life, almost 12 GB of heap space was allocated for it.

    In one of our experiments, we determined that we could not serve > 50 requests per second due to wasted garbage collection time.

Request response delay
Request response delay
  • Overflowing the cache with dynamic packages due to their frequent eviction and re-initialization caused a heavy load on the processor, which began to affect other services running on the same machine.

All of these issues have led to performance degradation on the Yelp frontend and several incidents.

System Rebuild Goals

Having dealt with these incidents, we began to rebuild the server rendering system. The goals were stability, observability, and simplicity. The new system should function and scale without much manual intervention.

It must provide seamless observability, both for the infrastructure maintenance teams and for the feature teams that own the packages. The design of the new system should be easy to understand for future developers.

In addition, we have chosen a number of specific functional goals:

  • Minimize the effects of blocking the event loop to ensure that mechanisms such as request response timeouts work correctly.

  • Segmentation, or division of service instances into packages, so that each package has its own unique resources. This reduces the overall amount of resources required and simplifies the observability of the system in terms of the performance of specific packages.

  • Instant rejection of requests that cannot be processed quickly. If we know that the request will take a long time to complete, then we should ensure that the system immediately returns to client-side rendering, and not watch server-side rendering time out. This ensures the fastest interaction with the user interface.


Language selection

When it came time to implement a server-side rendering service, we had several languages ​​to choose from, including Python and Rust. From an internal ecosystem point of view, Python would be ideal.

But we felt that V8 bindings for Python had not reached a state of operational readiness: to use them in server rendering, a significant investment is required.

We then evaluated Rust. He has high quality V8 bindingswhich are already being used in popular, commercially viable designs such as Deno.

However, all of our server rendering packages use the Node runtime API, which is not part of plain V8, so we would have to re-implement significant portions of it to support server rendering. This, and the general lack of support for Rust in the Yelp developer ecosystem, kept us from taking advantage of it.

As a result, we decided to rewrite the server rendering service in Node, because there is V8 VM APIwhich allows developers to run JS in isolated V8 contexts, has high-quality support in the Yelp developer ecosystem, and makes it possible to reuse code from other internal Node services, leaving less implementation work to do.


The server rendering service consists of a main thread and many worker threads. Node worker threads differ from OS threads in that each thread has its own event loop and memory cannot be simply shared between threads.

When an HTTP request arrives on the main thread, the following happens:

  1. Checks if the request should be rejected immediately based on the “timeout factor”. This factor currently includes average render time and current queue size, but it can include other metrics such as CPU usage and throughput.

  2. The request is added to the render worker thread pool queue.

When a request arrives on a worker thread, the following happens:

  1. Server rendering in progress. This blocks the event loop, but it’s still acceptable because only one request is processed per worker thread. While the event loop is running on the processor, the event loop should not be used anywhere else.

  2. The rendered HTML is returned to the main thread.

When a response arrives from the worker thread to the main thread, the rendered HTML is returned to the client.

Server Rendering Service Architecture
Server Rendering Service Architecture

In this approach, two important guarantees are given to meet our requirements:

  • The event loop never blocks on the web server’s main thread.

  • This loop is never needed as long as it is blocked on the worker thread.

We took the functionality described above from a third-party library Piscina. It allows you to control thread pools by supporting features such as task queuing, task cancellation, and many other features. For the operation of the main thread web server, we chose fastify for its high performance and developer friendliness.

Here is the server on Fastify:

const workerPool = new Piscina({...});'/batch', opts, async (request, reply) => {
       if (
           Math.min(avgRunTime.movingAverage(), RENDER_TIMEOUT_MSECS) * (workerPool.queueSize + 1) >
       ) {
           // Request is not expected to complete in time.
           throw app.httpErrors.tooManyRequests();
       try {
           const start =;
           currentPendingTasks += 1;
           const resp = await;
           const stop =;
           const runTime = resp.duration;
           const waitTime = stop - start - runTime;
           avgRunTime.push(, runTime);
               results: resp,
       } catch (e) {
           // Error handling code
       } finally {
           currentPendingTasks -= 1;


For horizontal scaling – autoscaling

The server rendering service is built on PaaSTA with mechanisms automatic scaling out of the box. We decided to create a custom autoscale signal using a pool of worker threads:

Math.min(currentPendingTasks, WORKER_COUNT) / WORKER_COUNT;

To make changes to horizontal scaling, this value is compared to our target usage (set value) within the moving time window.

We have found that with this signal, the load on each worker thread is kept in a healthier and better prepared state than with basic container CPU scaling. This ensures that all requests are serviced in a reasonable amount of time without overloading worker threads or overscaling the service.

For vertical scaling – auto adjust

Yelp is made up of many pages with different traffic loads, so the server rendering service segments supporting these pages have vastly different resource requirements.

Instead of defining resources statically for each segment of the server rendering service, we took advantage of their dynamic auto-tuning: automatically adjusted container resources such as processors and segment memory over time.

These two scaling mechanisms guarantee each shard the instances and resources it needs, no matter how little or how much traffic it receives. The main benefit is the seamless yet cost-effective operation of the server-side rendering service on various pages.


By rewriting the server rendering service with Piscina and Fastify, we were able to avoid the blocking event loop problem that the previous implementation suffered.

By reducing the cost of cloud computing, we were able to get more performance by combining the sharding approach with improved scaling signals. Here are the specific improvements:

  • Average p99 125ms reduction in server rendering of a package.

  • By reducing the number of packets initialized when loading, the service starts up faster: from several minutes on the old system to several seconds on the new one.

  • Through the use of a custom scaling factor and more efficient resource tuning for each segment, cloud computing costs are reduced. Now it is up to a third of the costs of the previous system.

  • Improved observability, because now each segment is involved in the rendering of only one package. This allows teams to quickly understand exactly where something is going wrong.

  • A more extensible system has been created to allow for further improvements such as processor profiling and support for source package maps.

And we will help you upgrade your skills or master a profession that will be in demand at any time:

Choose another in-demand profession.

Brief catalog of courses and professions

Data Science and Machine Learning

Python, web development

Mobile development

Java and C#

From basics to depth

As well as

Similar Posts

Leave a Reply