RBKmoney Payments under the hood – the infrastructure of the payment platform

Hi, Habr! The description of the operation of the insides of a large payment platform will logically continue with a description of how exactly these components work in the real world on the physical hardware. In this post, I’ll talk about how and where platform applications are placed, how traffic from the outside world reaches them, and also describe the layout of a standard rack for us with equipment located in any of our data centers.

Approaches and limitations

One of the first requirements that we formulated before the development of the platform sounds like "the ability to close to linear scaling of computing resources to ensure the processing of any number of transactions."

The classic approaches of paid systems used by market participants imply a ceiling, albeit rather high according to statements. It usually sounds like this: "our processing can take 1000 transactions per second."

This approach does not fit into our business objectives and architecture. We do not want to have some kind of limit. In fact, it would be strange to hear from Yandex or Google a statement "we can process 1 million search queries per second." The platform should process as many requests as the business needs at the moment due to the architecture, which allows, to put it simply, send it to DC of ITshnik with a cart of servers that it installs in racks, connects to the switchboard and leaves. And the platform orchestrator will roll out copies of business applications to new capacities, as a result of which we will get the necessary increase in RPS.

The second important requirement is to ensure high availability of the services provided. It would be fun, but not very useful, to create a payment platform that can accept an infinite number of payments in / dev / null.

Perhaps the most effective way to achieve high availability is to duplicate entities that serve a service so that the failure of any reasonable number of applications, equipment or data centers does not affect the overall availability of the platform.

Repeated duplication of applications requires a large number of physical servers and related network equipment. This iron costs money, the amount of which from us of course, we cannot afford to buy a lot of expensive iron. So the platform is designed so that it is easy to place and feel good on a large number of inexpensive and not too powerful hardware, or in the public cloud in general.

Using servers that are not the strongest in terms of computing power has its advantages – their failure does not have a critical impact on the overall state of the system as a whole. Imagine what is better – if an expensive, large and super-reliable branded server is running that uses a master-slave DBMS (and according to Murphy’s law, it will necessarily burn, and on December 31 in the evening) or a couple of servers in a cluster of 30 nodes running masterless – schematic?

Based on this logic, we decided not to create one more massive point of failure in the form of a centralized disk array. Common block devices are provided by a Ceph cluster deployed hyperconvergently on the same servers, but with a separate network infrastructure.

Thus, we logically came to the general scheme of a universal rack with computing resources in the form of inexpensive and not very powerful servers in the data center. If we need more resources, we either finish off any free rack with servers, or put another one, preferably closer.

Well, in the end, it's just beautiful. When a clear amount of the same iron is installed in the racks, this allows us to solve problems with high-quality wire-laying equipment, allows us to get rid of swallow nests and the danger of getting entangled in wires by dropping processing. A good from an engineering point of view, the system should be beautiful everywhere – and from the inside in the form of code, and outside in the form of servers and network hardware. A beautiful system works better and more reliably; I have had enough examples to see this from personal experience.

Please do not think that we are grain droppers or a business pins on financing. Developing and maintaining a distributed platform is actually very expensive. In fact, it is even more expensive than owning a classical system built, conditionally, on a powerful brand hardware with Oracle / MSSQL, application servers and other binding.

Our approach pays off with high reliability, very flexible horizontal scaling capabilities, the lack of a ceiling for the number of payments per second, and strange as it may sound, a large number of fans for the IT team. For me, the level of enjoyment of developers and devops from the system they create is no less important than the predictable development timeframe, the quantity and quality of the features being rolled out.

Server infrastructure

Logically, our server capacities can be divided into two main classes: servers for hypervisors, for which the density of CPU cores and RAM per unit is important, and the storage server, where the main emphasis is placed on the amount of disk space per unit, and CPU and RAM are already selected number of disks.

Currently our classic server for computing power looks like this:

  • 2xXeon E5-2630 CPU;
  • 128G RAM;
  • 3xSATA SSD (Ceph SSD pool);
  • 1xNVMe SSD (dm-cache).

Server for storing states:

  • 1xXeon E5-2630 CPU;
  • 12-16 HDD;
  • 2 SSD for block.db;
  • 32G RAM.

Network infrastructure

In the choice of network hardware, our approach is somewhat different. For switching and routing between vlan-s, we still use branded switches, now it is Cisco SG500X-48 and Cisco Nexus C5020 in SAN.

Physically, each server is connected to the network by 4 physical ports:

  • 2x1GbE – network management and RPC between applications;
  • 2x10GbE – network for storage.

The interfaces inside the machines are bonded together, then the tagged traffic diverges according to the necessary vlans.

Perhaps this is the only place in our infrastructure where you can see the label of a famous vendor. Because for routing, network filtering and traffic inspection, we use linux hosts. We do not buy specialized routers. All that we need we configure on servers running Gentoo (iptables for filtering, BIRD for dynamic routing, Suricata as IDS / IPS, Wallarm as WAF).

Typical rack in DC

Schematically, when scaling racks in DCs, they practically do not differ from each other with the exception of routers to uplinks, which are installed in one of them.

The exact proportions of servers of different classes can vary, but in general the logic is preserved – there are more servers for calculations than servers for data storage.

Block devices and resource sharing

Let's try to put everything together. Imagine that we need to place several of our microservices in the infrastructure, for greater clarity, these will be microservices that need to communicate with each other via RPC and one of them is Machinegun, which stores the state in the Riak cluster, as well as some ancillary services such like ES and Consul.

A typical layout will look like this:

VMs with applications that require the maximum speed of a block device, like Riak and Elasticsearch hot nodes, use partitions on local NVMe disks. Such VMs are tightly bound to their hypervisor, and applications themselves are responsible for the availability and integrity of their data.

For common block devices, we use Ceph RBD, usually with write-through dm-cache on a local NVMe disk. The OSD for the device can be either full-flash or HDD with an SSD log, depending on the desired response time.

Delivery of traffic to applications

To balance requests coming from outside, we use the standard OSPFv3 ECMP scheme. Small virtual machines with nginx, bird, consul announce in the OSPF cloud shared anycast addresses from the lo interface. On routers for these addresses, bird creates multi-hop routes that provide per-flow balancing, where flow is "src-ip src-port dst-ip dst-port". To quickly disable the missing balancer, the BFD protocol is used.

When any of the balancers are added or failed, the upstream routers appear or delete the corresponding route, and the network traffic is delivered to them according to the Equal-cost multi-path approaches. And if we do not specifically intervene, then all network traffic is evenly distributed to all available balancers on the IP stream to each.

By the way, the ECMP-balanced approach has unobvious pitfalls that can lead to completely unobvious losses of some traffic, especially if there are other routers or strangely configured firewalls on the route between the systems.

To solve the problem, we use the PMTUD daemon in this part of the infrastructure.

Then the traffic goes inside the platform to specific microservices according to the configuration of nginx on balancers.

And if balancing outdoor traffic is more or less simple and clear, then it would be difficult to extend such a scheme further inwards – we need something more than just checking the availability of a container with microservice at the network level.

In order for microservice to start receiving and processing requests, it must register with Service Discovery (we use Consul), undergo every second health check and have a reasonable RTT.

If microservice feels and behaves well, Consul begins to resolve the address of its container when accessing its DNS by the name of the service. We use the inner zone service.consul, and, for example, the Common API microservice version 2 will be named capi-v2.service.consul.

The nginx config file regarding balancing in the end is ours:

location = / v2 / {
    set $ service_hostname "$ {staging_pass} capi-v2.service.consul";
    proxy_pass http: // $ service_hostname: 8022;

Thus, if we again do not interfere specifically, the traffic from the balancers is evenly distributed among all the microservices registered in Service Discovery, the addition or deletion of new instances of the required microservices is fully automated.

If the request from the balancer went upstream, and he died on the way, we return 502 to the outside – the balancer at its level cannot determine whether the request was idempotent or not, therefore, we give the processing of such errors to a higher level of logic.

Idempotency and deadlines

In general, we are not afraid and do not hesitate to give 5xx errors to the API, this is a normal part of the system, if we make the correct handling of such errors at the RPC business logic level. The principles of this processing are described in the form of a small manual called the Errors Retry Policy, we distribute it to our merchant clients and implement it within our services.

To simplify this processing, we have implemented several approaches.

First, for any status-changing requests to our API, you can specify a unique idempotency key within the account that lives forever and allows you to be sure that a repeated call with the same data set will return the same answer.

Secondly, we implemented an additional mechanism in the form of a unique identifier for a payment session, which guarantees the idempotency of withdrawal requests, providing protection against erroneous repeated debits, even if you do not generate and transmit a separate idempotency key.

Thirdly, we decided to give a predictable and controllable outside response time to any external call to our API in the form of a time cutoff parameter that determines the maximum wait time for the operation to complete on request. For example, simply pass the HTTP header. X-Request-Deadline: 10s, to be sure that your request will be fulfilled within 10 seconds or will be killed by the platform somewhere inside, after which we can be contacted again, guided by the policy of re-sending requests.

Platform Management and Ownership

We use SaltStack as a tool for managing both configurations and infrastructure as a whole. Separate tools for automated control of the computing power of the platform have not yet taken off, although now we understand that we will go in this direction. With our love for Hashicorp products, this is likely to be Nomad.

The main infrastructure monitoring tools are checks in Nagios, but for business entities, we basically set up alerts in Grafana. It is a very convenient tool for setting conditions, and the event-based platform model allows you to write everything into Elasticsearch and customize the selection conditions.

The data centers are located in Moscow, in them we rent spacer-stands, independently install and manage all the equipment. We don’t use dark optics anywhere, we only have Internet from local providers.

Otherwise, our approaches to monitoring, management and related services are rather standard for the industry, not sure that the next description of the integration of these services is worth mentioning in the post.

On this article, I, perhaps, will finish a cycle of survey posts about how our payment platform is arranged.

I think that the cycle turned out to be quite frank, I met few articles that would reveal the internal kitchen of large payment systems in such detail.

In general, in my opinion, a high level of openness and frankness is a very important thing for a payment system. This approach not only increases the level of trust of partners and payers, but also disciplines the team, creators and service operators.

So, guided by this principle, we recently made public the status of the platform and the uptime history of our services. The entire subsequent history of our uptime, updates and downs is now public and available at https://status.rbk.money/.

I hope you were interested, and perhaps someone our approaches and the described errors will be useful. If you are interested in any of the directions described in the posts, and you would like me to reveal them in more detail, please do not hesitate to write in the comments or in a personal.

Thank you for being with us!

P.S. For your convenience, a pointer to the previous articles of the cycle:

  • introduction and description of how it all began;

  • microservices and platform business configuration;

  • implementation of processing business logic platform.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *