How to Configure a Web Application for High Loads
Hello, my name is Alexander Adadurov. I am the project manager of the Federal State Budgetary Institution “Center for Information and Technical Support”. In this article, I will describe the experience of setting up a website with educational content for a peak load of up to 15,000 requests per second or up to several million users per day.
The educational content of the site consisted of illustrated HTML pages, video tutorials, and various interactive tasks, mostly in JavaScript, which checked the correctness of the tasks by making requests to the backend. The site lived a quiet life and developed sluggishly until the introduction of lockdowns due to the spread of COVID-19. The first months of quarantine significantly changed the application code, its architecture, and even the server infrastructure on which it was located.
Original architecture
The development team consisted of 3–5 people at different times, the project was written over several years, during which views on the architecture and the concept as a whole changed. Individual parts were rewritten, the team changed. As a result, by the beginning of the pandemic, the project code was quite loose and not always verified in terms of optimality. When the load increased, the code contained classes, methods, and even bundles, the purpose of which the team did not fully understand.
The site was written in the Symfony 3 PHP framework, without a clear division into front and back. Web interfaces were rendered using the Twig template engine, and JQuery was used primarily for interactivity. PostgreSQL 9.6 was used as a DBMS, and some of the data was cached in the NoSQL Redis DBMS at the initiative of the developers. The site had an API for loading and multi-stage processing of new content, for which a queue system was built on two RabbitMQ brokers.
The project was located on 16 physical servers, the frontends and backends had 24 cores and 128 RAM each, the DBMS nodes had 56 cores and 512 GB of RAM. Each server had four 10-gigabit network interfaces, which provided an aggregated channel of 40 Gbit. The nodes had 2 TB hard drives with the OS installed, and the backend nodes additionally hosted the PHP/Symfony code. Shared resources such as images, videos, and downloads that were required on all nodes were stored in the storage system and mounted to each node as NFS network shares.
The original architecture already included some ideas for handling high loads.
For example, the project was divided into two segments based on the type of content processing and consisted of a “video service” and an “engine”.
Video service was located on a separate video subdomain. All video materials were uploaded to the video service, processed separately and embedded into the content via
Engine was a classic HTML content management system: authorization, favorites, action history, catalog.
At the entrance there were Nginx balancers (frontends), two for each segment, incoming requests were distributed between the balancers by the DNS server using the method Round Robin. Requests were distributed between backends according to the algorithm Least Connectionswhen the next request is passed to the backend with the least number of connections:
upstream backend { least_conn; server 192.168.1.100:80 weight=10 max_fails=10 fail_timeout=2s; ... server 192.168.1.104:80 weight=10 max_fails=10 fail_timeout=2s; }
For long-term resource-intensive operations, such as downloading, unpacking, processing new content, preparing videos and cutting chunks for the video service, RabbitMQ queues and additional operating system software were used: ffmpeg, zip, wkhtmltopdf.
A 20-gigabit Internet channel with the possibility of expansion to 40 Gbit was connected to the server room. We, as networkers say, “sat on the nine” (MMTS-9).
Increase in load
With everyone moving to remote work in April 2020, the load on the portal increased sharply. Various monitoring and visualization tools for operations played a major role in detecting problems and finding solutions: Zabbix, Symfony Profiler, Cockpit, DBeaver, Nginx Amplify.
At certain moments, Zabbix and other monitoring tools showed a total load of up to 15,000 requests per second. This was largely due to advertising campaigns conducted by colleagues. Each advertising campaign brought another surge. We quickly found out that the site could not cope with such loads: users saw a 502 Bad Gateway error on the screen, or the site did not respond at all, as in a DDoS attack. Something had to be done urgently.
Below are the metrics for that period: total number of requests and traffic per week.
It was clear from the monitoring systems that the problem was not in one place, everything was overloaded: frontends, backends, and DBMS. A comprehensive solution was required, so optimization was done in parallel in several directions with continuous interaction of all colleagues. I will describe in order what was done.
Load balancing
The bottleneck on the front-balancers, as the graphs showed, was Nginx logs. To optimize disk operations, we enabled Nginx log buffering (the buffer and flush parameters in the access_log settings of the HTTP block of the nginx.conf file). Nginx in our configuration dumped request logs to the local disk at certain intervals, and these moments were sharp spikes on the graphs. The result was a comb graph, and sometimes another such spike “went into the shelf”, that is, the balancer hung.
To fix the problem, as an emergency measure we moved the logs to virtual RAM disks, which we created using OS tools.
mount -t tmpfs -o size=25G tmpfs /mnt/ramdisk
The size was calculated empirically. At the first stage it helped, and later we reconfigured logging and disabled log buffering.
Redis and Caching
The next task was to reduce the load on the backends. We solved this by caching everything that could be cached. In the original architecture, Redis mainly stored session keys, and some developers also stored part of the symphonic cache there. This was not regulated in any way, individual developers did it on their own initiative.
Since the main page had the heaviest load, we started caching it so that this page was 100% served from the Redis cache. After that, it started opening instantly. Then we reviewed all the code of the most frequently used functions, adjusted it for caching and storing the cache in Redis. This also gave its results, the speed increased, and the load on the backends dropped.
Later, the speed was further increased when a separate server was allocated for Redis and a Redis cluster of 10 nodes was created, in which each node does not have all the data, but knows which node has it.
/server/redis/redis-cli --cluster create
192.168.1.70:7000
192.168.1.70:7001
192.168.1.70:7002
192.168.1.70:7003
192.168.1.70:7004
192.168.1.70:7005
192.168.1.70:7006
192.168.1.70:7007
192.168.1.70:7008
192.168.1.70:7009
--cluster-replicas 1
--cluster-yes
In this case, we also used Symfony's ability to work directly with Redis clusters.
But no matter how much we were pleased with the increase in speed with the appearance of the Redis cluster and the transfer of the symphonic cache to it, this implementation still gave losses. Comparative tests showed that communication with a cluster located somewhere in the network, even in the same server room, still works slower than receiving data directly from local RAM.
In addition, a few days later, network colleagues came and showed graphs showing that the network was overloaded with requests to the Redis cluster. Then it was decided to split the entire cache into two levels:
the first, which would include data that is relevant only for each individual backend;
the second one, which was needed by everyone and where user sessions and other data that arose after user authorization were allocated.
DBMS
The single node dedicated to the DBMS engine, although powerful (56 cores and 512 GB RAM), also could not cope with the number of requests. We divided the DB optimization into two parts: working with the code and organizing the DBMS cluster.
Working with code. Using Postgres and, primarily, the built-in Symfony profiler, we identified redundant and unjustifiably complex queries to the database and adjusted the code.
DBMS cluster. It turned out that the DB also hangs due to the avalanche-like growth of the number of connections. To manage the connection pool, we set at the input PgBouncer in session pool mode (pool_mode = session).
PgBouncer is an application from the PostgreSQL ecosystem that manages a pool of database connections, and for the client this happens transparently, as if the connection is made to the PostgreSQL server itself. PgBouncer accepts connections, passes them to the DBMS server or puts them in a queue when all connections in the pool (default_pool_size) are busy. When connections from the pool are released, the queue is processed.
Four servers were also added to the DBMS and a DBMS cluster of five nodes was created. Queries were distributed across the nodes using PGPool — another useful application for PostgreSQL. PgPool was configured for load balancing, and so that write requests (INSERT, UPDATE, DELETE) were sent only to the master node.
# Enabling load balancing for read queries
load_balance_mode = on
# Enabling master-slave mode with streaming replication
master_slave_mode = on
master_slave_sub_mode="stream"
Incoming requests on the PostgreSQL nodes themselves were also limited (max_connections).
Increasing the number of backends and frontends
The above measures allowed us to optimize the site's operation and effectively use the existing server capacities: the cores and memory were loaded. Everything would have been fine, but after a month of such intense work, various hardware components of the servers began to fail, somewhere the memory, somewhere the network interface or disk. It became clear that the optimal use of server capacities would be not 70-80% of their capabilities, but approximately 40-50%.
In addition, the site still occasionally experienced slowdowns. Then we decided to increase the number of frontends to five and backends to seven.
In conclusion
The optimization story did not end there. The next bottleneck of the system was the 20-gigabit channel, which periodically filled up to 100%. The most acceptable solution in all respects seemed to be to modify the site to use CDN, but that's another story.
In addition to initially designing the system for high loads, it is also important to follow certain rules at all stages of development, taking into account that the project will operate under high load. Here are some of these rules:
use queries with parameters to avoid processing the entire array of objects for one property;
minimize the use of cycles with database queries;
delve into the internal workings of functions and methods of third-party developers, various bundles and plugins, use them with understanding and taking into account their features, use built-in language functions whenever possible;
where possible, use queues to asynchronously process resource-intensive and long operations, such as sending an email, downloading a file;
cache everything you can;
transfer parts of the functionality from the backend to the user's browser;
allocate static content into a separate segment with the ability to connect to a CDN.
ENTER — DIY media for IT professionals. Share personal stories about solving various IT problems and get rewarded.