How did we survive the sharp increase in x10 load on the remote site and what conclusions did

Hello, Habr! The last couple of months we have lived in a very interesting situation, and I would like to share our history of infrastructure scaling. During this time, SberMarket grew 4 times in orders and launched a service in 17 new cities. The explosive growth in demand for food delivery has required us to scale our infrastructure. Read the most interesting and useful findings under the cat.


My name is Dima Bobylev, I am the Technical Director of SberMarket. Since this is the first post on our blog, I’ll say a few words about myself and about the company. Last fall, I participated in the contest of young leaders of the Runet. For the contest I wrote a short story about how we at SberMarket see the internal culture and approach to the development of the service. And although it was not possible to win the competition, I formulated for myself the basic principles for the development of the IT ecosystem. When managing a team, it is important to understand and find a balance between what the business needs and the needs of each specific developer. SberMarket grows 13 times or more from year to year, and this affects the product, creating a constant deficit in development. At the same time, it is necessary to find time for preliminary analysis and high-quality code writing. The formed approach helps not only in creating a working product, but also in its further scaling and development. As a result of this growth, SberMarket has already become a leader among food delivery services: we deliver about 18 thousand orders a day every day, although in the beginning of February there were about 3,500 of them.


Once a client asked the courier SberMarket to deliver products to him contactlessly – directly to the balcony

But let’s move on to the specifics. Over the past few months, we have been actively scaling the infrastructure of our company. Such a need was explained by external and internal factors. Simultaneously with the expansion of the customer base, the number of connected stores increased from 90 at the beginning of the year to more than 200 by mid-May. Of course, we prepared, reserved the main infrastructure, and counted on the possibility of vertical and horizontal scaling of all virtual machines located in the Yandex cloud. However, practice has shown: “Everything that can go wrong goes wrong.” And today I want to share the most interesting situations that have happened in these weeks. I hope our experience will be useful to you.

Slave in full alert

Even before the pandemic began, we were faced with an increase in the number of requests to our backend servers. The tendency to order products with home delivery began to gain momentum, and with the introduction of the first self-isolation measures in connection with COVID-19, the load grew dramatically before our eyes all day. There was a need to quickly unload the master servers of the main database and transfer part of the read requests to the replica servers (slaves).

We were preparing in advance for this step, and for such a maneuver 2 slave servers had already been launched. They mainly worked on batch tasks of generating information feeds for exchanging data with partners. These processes created an extra load and quite rightly were put “outside the brackets” a couple of months earlier.

Since there was replication on Slave, we adhered to the concept that applications can only work with them in read only mode. Disaster Recovery Plan suggested that in the event of a disaster, we could simply mount the Slave in place of the Master and switch all write and read requests to the Slave. However, we also wanted to use replicas for the needs of the analytics department, so the servers were not completely transferred to read only status, and each host had its own set of users, and some had write permissions to save intermediate calculation results.

Up to a certain level of load, we had enough wizards for both writing and reading when processing http requests. In mid-March, just when Sbermarket decided to completely switch to a remote site, we began a multiple increase in RPS. More and more of our customers went to self-isolation or work from home, which was reflected in the load indicators.

The performance of the “master” was no longer enough, so we began to endure some of the heaviest read requests for a replica. To transparently send requests to write to the master, and reading to the slave, we used ruby ​​gem “Octopus“. We created a special user with the _readonly postfix without write permissions. But due to an error in the configuration of one of the hosts, part of the write requests went to the slave server on behalf of a user who had the appropriate rights.

The problem did not manifest itself immediately, because increased load increased slave lag. Data inconsistency was revealed in the morning when, after nightly imports, the slaves did not “catch up” with the master. We attributed this to the high load on the service itself and the import associated with the launch of new stores. But giving data with a many-hour delay was unacceptable, and we switched the processes to the second analytical slave, because it had aaboutI didn’t get the most resources and was not loaded with read requests (which is why we explained to ourselves the lack of replication lag).

When we figured out the reasons for the “creep” of the main slave, the analytic already failed for the same reason. Despite the presence of two additional servers, to which we planned to transfer the load in the event of a master crash, due to an unfortunate error, it turned out that at a critical moment there are none.

But since we did not only dump the database (the rest at that time was about 5 hours), but also a snapshot master server, we managed to start the replica within 2 hours. True, after that we were expected to roll the replication log for a long time (because the process is in single-threaded mode, but this is a completely different story).

Conclusion: After such an incident, it became clear that we had to abandon the practice of restricting records for users and declare readonly the entire server. With this approach, there is no doubt that replicas will be available at a critical time.

Optimization of even one heavy query can “bring to life” the database

Although we constantly update the catalog on the site, the requests that we made to the Slave servers allowed a slight lag from Master. The time during which we discovered and eliminated the problem of “suddenly dropping out of the distance” slaves was more than a “psychological barrier” (during this time prices could have updated, and customers would have seen outdated data), and we had to switch all requests to the main database server . As a result, the site worked slowly … but at least it worked. And while the Slave was recovering, we had no choice but to optimize.

While the Slave servers were recovering, the minutes slowly dragged on, the Master remained overloaded, and we devoted all our efforts to optimizing active tasks according to the Pareto Rule: we selected the TOP requests that give most of the load and started tuning. This was done directly “on the fly.”

An interesting effect was that MySQL loaded to the eyeball responds to even a slight improvement in processes. Optimization of a pair of requests, which gave only 5% of the total load, already showed tangible CPU unloading. As a result, we were able to provide an acceptable supply of resources for Master to work with the database and get the necessary time to restore replicas.

Conclusion: Even a small optimization allows you to “survive” during overload for several hours. This was just enough for us during the recovery of servers with replicas. By the way, we will discuss the technical side of query optimization in one of the following posts. So subscribe to our blog if this may be useful to you.

Organize performance monitoring of partner services

We are engaged in processing orders from customers, and therefore our services constantly interact with third-party APIs – these are gateways for sending SMS, payment platforms, routing systems, geocoder, the Federal Tax Service and many other systems. And when the load began to grow rapidly, we began to rest against the API limitations of our service partners, which we had not even thought about before.

Unexpected excess of quotas for affiliate services can lead to downtime of your own. Many APIs block clients that exceed the limits, and in some cases, an excess of requests can overload production with a partner.

For example, at the time of increasing the number of deliveries, the accompanying services could not cope with the tasks of their distribution, determination of routes. As a result, it turned out that the orders were made, and the service creating the route does not work. I must say that our logisticians made it almost impossible under these conditions, and the clear interaction of the team helped to compensate for temporary service failures. But such a volume of applications is impossible to process manually manually, and after some time we would encounter an unacceptable gap between orders and their execution.

A number of organizational measures were taken and the coordinated work of the team helped to gain time while we agreed on new conditions and waited for the modernization of services from some partners. There are other APIs that please you with high endurance and godless tariffs in case of high traffic. For example, in the beginning we used one well-known mapping API to determine the address of a delivery point. But at the end of the month they received a tidy bill of almost 2 million rubles. After that, they decided to quickly replace it. I will not engage in advertising, but I will say that our expenses have decreased significantly.

Conclusion: It is necessary to monitor the working conditions of all partner services and keep them in mind. Even if today it seems that they are “with a large margin,” this does not mean that tomorrow they will not become an obstacle to growth. And, of course, it’s better to agree on the financial conditions of increased service requests in advance.

Sometimes it turns out that “Need more gold“(C) does not help

We are used to “plugging” in the main database or application servers, but when scaling up, troubles can appear where they were not expected. For full-text search on the site, we use the Apache Solr engine. With an increase in load, we noted a decrease in response time, and server processor load reached 100%. What could be simpler – we will give out more resources to the container with Solr.

Instead of the expected performance gain, the server simply “died”. It immediately loaded 100% and responded even more slowly. Initially, we had 2 cores and 2 GB of RAM. We decided to do what usually helps – we gave the server 8 cores and 32 GB. Everything has become much worse (exactly how and why – we will tell in a separate post).

For several days, we figured out the intricacies of this issue, and achieved optimal performance with 8 cores and 32 GB. This configuration allows us to continue to increase the load today, which is very important because the growth is not only in terms of customers, but also in the number of connected stores – over 2 months their number has doubled.

Conclusion: Standard methods like “add more iron” do not always work. So when scaling up any service, you need to understand well how it uses resources and pre-test its operation in new conditions.

Stateless – the key to easy horizontal scaling

In general, our team adheres to a well-known approach: services should not have stateless state and should be independent of the runtime environment. This allowed us to survive the load growth by simple horizontal scaling. But we had one service exception – a handler for long background tasks. He was engaged in sending emails and sms, processing events, generating feeds, importing prices and stocks, and processing images. It so happened that it depended on the local file storage and was in a single copy.

When the number of tasks in the processor queue increased (and this naturally happened with an increase in the number of orders), the performance of the host hosting the processor and file storage became a limiting factor. As a result, the updating of the assortment and prices, the sending of notifications to users and many other critical functions stuck in the queue stopped. The Ops team quickly migrated the file storage to an S3-like network storage, and this allowed us to raise several powerful machines to scale the background task handler.

Conclusion: The Stateless rule must be observed for all components, without exception, even if it seems “that we are definitely not bumping in here.” It is better to spend a little time on the correct organization of the work of all systems than to rewrite the code in a hurry and repair the service that is experiencing overload.

7 principles for intense growth

Despite the availability of additional capacity, in the process of growth we stepped on a few rakes. During this time, the number of orders increased by more than 4 times. Now we already deliver more than 17,000 orders a day in 62 cities and plan to expand our geography even further – in the first half of 2020, service is expected to be launched throughout Russia. In order to cope with the growing load, taking into account the already full bumps, we have developed for ourselves 7 basic principles of work in conditions of constant growth:

  1. Incident management. We created a board in Jira, where every incident is reflected in a ticket. This will help to actually prioritize and perform incident-related tasks. Indeed, in essence, it’s not scary to make mistakes – it’s scary to make mistakes twice on the same occasion. For those cases when the incidents are repeated before the cause can be fixed, an instruction for action should be ready, because during a heavy load it is important to respond with lightning speed.
  2. Monitoring required for all infrastructure elements without exception. Thanks to him, we could predict the growth of the load and choose the right “bottlenecks” to prioritize elimination. Most likely, with a high load, everything that you did not think will break or begin to slow down. Therefore, it is best to create new alerts immediately after the onset of the first incidents in order to monitor and anticipate them.
  3. Correct Alerts just necessary with a sharp increase in load. First, they must report what exactly is broken. Secondly, there should not be many alerts, because the abundance of non-critical alerts leads to ignoring all alerts in general.
  4. Applications must be stateless. We made sure that there should be no exceptions to this rule. We need complete independence from the runtime. To do this, you can store shared data in the database or, for example, directly in S3. Better yet, follow the rules. https://12factor.net. During a sharp increase in time, there is simply no way to optimize the code, and you will have to cope with the load by directly increasing the computing resources and horizontal scaling.
  5. Quotas and performance of external services. With rapid growth, a problem can arise not only in your infrastructure, but also in an external service. The most annoying thing is when this happens not because of a failure, but because of reaching quotas or limits. So external services should scale as well as you yourself.
  6. Separate processes and queues. This helps a lot when a plug occurs on one of the gateways. We would not encounter delays in data transmission if the completed queues for sending SMS did not interfere with the exchange of notifications between information systems. And the number of workers would be easier to increase if they worked separately.
  7. Financial realities. When there is an explosive growth in data flows, there is no time to think about tariffs and subscriptions. But you need to remember them, especially if you are a small company. The owner of any API, as well as your hosting provider, can set a big bill. So you need to read the contracts carefully.

Conclusion

Not without losses, but we survived this stage, and today we try to adhere to all found principles, and each machine has the ability to easily increase x4 performance to cope with any surprises.

In the following posts, we will share our experience in investigating performance subsidence in Apache Solr, as well as talk about query optimization and how interaction with the Federal Tax Service helps the company save money. Subscribe to our blog so that you don’t miss anything, and tell us in the comments if you had such troubles during the growth of traffic.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *