how to prepare a marketplace for heavy loads?

The stabilization of the technical part of the e-commerce product and the verification of the correctness of all processes should be started in advance and completed several weeks before the period of increased customer activity (for example, by Black Friday).

But the mass sale of goods often entails instability of the entire IT system of the marketplace, because no actions are taken to update the configuration or code. We are with the team Scallium tried to figure out what needs to be done in advance so that ecommerce can withstand the influx of buyers.

The technical team is always ready to hedge

Even though you are confident in your IT system, there is always the possibility that something will go wrong. You must have a monitoring and notification system that allows you to be promptly informed about abnormal platform behavior so that you can take measures to painlessly eliminate any problems.

To respond quickly to issues, your development team needs to be in fast availability. The goal is to be able to resolve it before problems become critical.

Fault tolerance and reliability as part of the architecture

Your marketplace or ecosystem should be designed and developed from the outset to be able to handle rapidly increasing workloads.

We recommend that you create a product divided into separate services, each of which is a separate component that has the ability to automatically scale out (Kubernetes).

Communication between services should be done by exchanging messages through asynchronous queues with retry logic in the event of an error. This gives additional reliability if any of the services is overloaded, allowing you to process the message a little later, when the load drops or the service scales towards increasing resources.

For example, if you are building a marketplace based on a Scallium product, then our service architecture (SOA), combined with a Kubernetes cluster, allows horizontal scaling. The triggers for starting the process of scaling services are the increasing load on the CPU, an increase in the amount of used RAM, and an increasing number of messages on the bus. This function in the Service architecture is usually performed by the Message Bus, which we have implemented based on RabbitMQ.

We have implemented Kubernetes-level auto-scaling, which we have deployed on the flexible Google Cloud Platform (GCP). This made it possible to seamlessly scale all resources, and then again automatically collapse them when the load dropped. The GCP platform allows you to lay out the functionality for automatic scaling of nodes in the Kubernetes cluster out of the box, which, with an increasing load on the application, increases the number of available resources for auto-scaling of the application itself.

Health Monitoring System SaaS platform monitoring system

We recommend developing a monitoring tool that automatically analyzes the health and performance of the system around the clock and automatically, fixes logs, generates reports on events and emergencies, and signals the attendants.

The data is analyzed by an appropriate service that predicts points of failure and enables technical experts to respond well in advance of emergency situations. The monitoring system in Scallium, for example, is linked to a corporate chat to inform the respective support lines.

The monitoring system should operate at 2 levels:

The infrastructure layer covers the level of availability of system services, network, cluster resources (Prometheus). At the application level, we have a Sentry DSN that allows us to track the operation of the application at the level of interaction between services.

These levels are complementary and allow, as a result of the analysis, to warn about the occurrence of abnormal situations. Or promptly take action on the problems that have arisen.

Marketplace platform backup system

One aspect of the reliability of any platform (for example, a marketplace) is the backup strategy. For example, the Scallium architecture is designed according to the service architecture paradigm, and in fact allows you to move key data storage points outside the Kubernetes cluster, and use services as a service from a provider (GCP).

This simplifies not only the administration of such an application, but also the creation / maintenance of the backup system. And since all the key points (in our Scallium these are sets of services of different types – PostreSQL, MongoDB, RabbitMQ, S3 bucket and dozens of Scallium services) are “stand alone” provider services, backup is implemented by native GSP tools, and this is ease of management, reliability , and reliability.