Updating frontal NSPK systems without service interruption

Front-end office (FO) systems are one of the main Mission Critical systems operated in NSPK today. They are responsible for processing and routing authorization requests between the Acquiring Bank and the Issuing Bank. It is through them that banks exchange data while you carry out an operation on the card. Up to 60 million authorizations per day pass through the FO, while at the peak they process 1800 TPS (transaction per second).

My name is Vadim Pashin, at NSPK I am in charge of front-office solutions and today I want to share my experience in implementing a system for managing bank connections.

FOs have a rather complex architecture and have 4-fold redundancy for each server.

We use 2 data centers for geo-redundancy. Each data center has nodes that receive connections and process traffic from banks. Each node serves some banks. There are the following reservations – a node serving the traffic of participants (node ​​A) has a copy inside the data center (node ​​B), and copies of these two nodes also exist in another data center.

There are 3 types of participant connection:

  • The participant has one active connection to 1 data center (Active-Passive);

  • The participant has two active connections to 2 data centers (Active-Active);

  • The participant has four active connections to 2 data centers (4 Active).

Like any other IT systems, FIs require periodic updates. We categorize updates into the following types:

  • Release;

  • Hotfix.

The release is born within 2-week sprints and may contain the following changes:

  • Business features – introduction of new business functionality into the payment system. For example, services such as “Purchase with cash advance”, the ability to use new Wallet providers (Mir Pay, Samsung pay, etc.);

  • Technical features – introduction of technical changes that simplify the maintenance of the system, increase its speed of work, transition to new technical solutions;

  • Bug fixing – elimination of bugs that do not affect the company’s business.

Changes in the form of hotfixes can be installed between releases and are intended to correct situations where there is an impact on the company’s business, and some of the traffic cannot be served correctly. At the same time, these can be not only errors in our system – it happens that after installing new versions of the system on the bank’s side, errors occur in processing its traffic, since the participant fills in some fields of the authorization protocol incorrectly. If the participant cannot quickly solve the problem, then, if possible, we correct errors on our side until the bank solves the problem on its side.

As a rule, all changes in the form of a release or hotfix require a complete shutdown of applications responsible for processing traffic on the node. This is required for the distribution of new libraries, restarting applications, as well as control by logs and through the monitoring system that no errors were generated at startup, and the FS modules are running in full. But we cannot stop processing traffic from banks, since their customers cannot wait at the cash register and / or ATMs when we complete the update and they can make a purchase or withdraw cash. We also strive for the availability of our service at 99.999%.

The update takes place as follows:

  1. Stopping the butt on reserve B nodes, where there is no traffic from the participants.

  2. Updating the FO on the V.

  3. Transferring traffic from active nodes A to updated nodes B by stopping node A.

  4. Control of the correctness of traffic processing, the absence of increased failures, errors in the logs.

  5. Updating nodes A.

  6. Nodes B are now active and Nodes A are standby.

Participants exchange authorization messages using an application protocol based on ISO 8583. It describes the format of financial messages and the process of transmission by systems that process bank card data. The transport protocol is TCP / IP. The participant has only two IPs for connection (one for each data center) and does not know which node (A or B) his traffic goes to. Previously, we used a so-called balancer, which checked the availability of node A when establishing a connection from the bank side. If it was available, a connection was established with node A, if node A was unavailable, a connection was established with node B.

The circuit with the balancer was as follows:

The use of balancers was convenient and easy to maintain, when the nodes were turned off, sessions were reinstalled on the backup nodes, but the operating experience revealed the following drawbacks:

  • the availability of a node is determined by the balancer only during the establishment of a session from the bank;

  • impossibility of carrying out updates of the FS without breaking the connections To transfer traffic to backup nodes B, all connections are broken, and the entire market needs to re-establish its sessions. Since after establishing sessions at the transport layer, the establishment of the application layer is also necessary. Most banks are able to restore their connections automatically, but different software banks do it at different speeds. Losing authorizations inevitably occurs during the switchover. This negatively affects our availability.

  • in case of incorrect processing of traffic on nodes B during the update, switching back to nodes A takes time.

We strive for 99.999% availability for our federal districts, so the company made a decision and launched a project to develop a new complex for participant traffic management. The following requirements were imposed on him:

  • the ability to quickly manually or automatically switch traffic between nodes A and B;

  • switching between nodes should not break the existing TCP session with banks;

  • fault tolerance. The new module must be reserved, its crash must also not cause the TCP session to break with banks;

  • convenient graphical web-based management interface with access control.

As a result, we got a new subsystem for managing connections with participants – MUPS / PUPS

The connection diagram has changed as follows:

The system got its name on behalf of two modules, of which it consists:

  • ПУПС – Application connection control proxy;

  • MUPS – Application Connection Control Module.

We moved the traffic termination points from banks from the data center to the traffic exchange points M9 and M10, where our communication equipment is located. We have also located the equipment for the implementation of the new smart balancer at these sites.

At each M9 / M10 traffic exchange point, we have placed an active and a reserve pair of MUPS / PUPS. Let’s move on to the description of these components and the principle of operation of the new complex. Servers with these pairs are combined into a VRRP cluster using keepalived and share one virtual IP among themselves.

PUPS responsible for the TCP-interaction of the balancing node with the processing software of banks. Implements a mechanism for replicating and transparently restoring TCP connections with a member organization in case of a regular switch:

  • accepts TCP connections;

  • initiates the exchange of data between the MUPS, the PMPS and the bank;

  • sends and receives application messages;

  • handles control connections between the CCM and the CCM;

  • re-establishes TCP connections and provides switching between primary and backup ICPS / RSSI.

MUPS, the second component of the system, is intended for:

  • maintaining connections with FO nodes;

  • management of bank connections (enable / disable, connect to node A or node B);

  • wrapping the ISO 8583 message (authorization information from the bank) into your protocol of interaction between the MPS and the FO nodes;

  • receiving messages from the FO node, expanding the ISO 8583 message and sending to the SSP;

  • submission of the PMCS command to migrate to the backup server.

One of the most important functions of the MUPS, for the sake of which it was created, is switching traffic processing to the reserve FO node and back without breaking the connection with the participating bank. This works because the CCM is between the CCM, which “holds” the connection to the bank, and the FI, which handles the traffic. The MUPS controls where exactly this traffic is directed at a given moment, and upon a command from the administrator, it performs switching between servers invisible to the bank and safe for processing operations.

It happens as follows:

  • frontal modules on command from MUPS go into synchronization state

  • the active module, which is currently processing operations, loads the contexts of in-flight operations (for which it expects, but has not yet received response messages from the bank) from its memory into the general in-memory data grid

  • the standby module takes over these contexts

  • upon completion of unloading, the MUPS deactivates the active module and transfers its new status and a number of runtime parameters with which the last active module worked to the backup

  • from this moment, the MPSS begins to direct traffic from the participant to the new active module

Two connections are used for data transmission and control of the MUPS. The first is the Data connection. It is used to transfer data on authorizations from the bank (ISO8583) to the FI and vice versa. The second connection is the Control connection. Used for the exchange of control messages between the CCM and the CCM. In the control connection, commands are used to check whether the active ICM / CCM pair — the heartbeat command — is alive, as well as a number of commands to move connections to a backup ICC / CCM pair within the site.

In the balancing node, the active RSSI interacts only with the RSSI installed on the same server as the RSSI.

If there are no heartbeats from the CCM for a specified time, the CCM on the active node starts the procedure for activating the second node in the cluster (if available) and then deactivates.

The migration process from the main server to the backup server is as follows:

  • PMPS on the main server sets the ready for migration flag;

  • a process dump is created on the main server, then the PUPS transfers its image to the backup server, and also sets the readiness for migration and recovery flag on the backup server;

  • When an image is detected, the DRM on the backup server migrates iptables rules and increases the Keepalived priority on the node, thereby starting the IP address transfer procedure;

  • after migrating the Keepalived IP address on the standby server, a running process is restored from the image. The priority of the Keepalived itself is also restored.

Thus, the fault tolerance of the MUPS / PUPS pair within the same site is provided.

The interaction of the MUPM and the FO node occurs according to its own protocol. The protocol transmits both payment information and checks the availability of the FO nodes using heartbeat, and can also transmit a number of commands necessary to move traffic to inactive FO nodes. It is very important: when moving from an active node, it is necessary to get all the payment information and transfer it to the backup node B. The presence of constant heartbeats between the MUPS and the FO nodes allows you to automatically diagnose the problem with the node and carry out instant transfers of the participant’s traffic to the backup node without breaking the connection with the participant.

Basically, the work of system administrators occurs through the WEB management console of the MUPS. It displays a list of banks that have a connection to us, the status of their connection. In a convenient interface, we can observe whether the connection is established only at the transport level, or there is a connection at the application level. We also see which node (A or B) the bank is connected to. By clicking the mouse, we can transfer the connections of the selected bank or all at once between nodes A and B. In this case, the participant does not see for himself any breaks and disappearance of authorization traffic.

Conclusion

The created MUPS / PUPS complex made it possible to solve a number of significant issues of the company regarding the management of applied connections of banks with the company:

  • all work on the FD remains unnoticed by the participants, there is no disconnection or loss of transactions;

  • in case of a problem at the FO node, the transfer of traffic to the reserve is carried out automatically and instantly, the bank also does not see the disconnection of the connection;

  • duty services and administrators of the federal district received a convenient and visual tool for managing connections. Taking a node out of work for updating the OS, replacing iron components is also not noticed by the participant and does not lead to transaction losses.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *