How the bank broke

The unsuccessful migration of IT infrastructure has damaged 1.3 billion records of bank customers. The fault was the lack of testing and a frivolous attitude to complex IT systems. Cloud4Y tells how it was.

English 2018 TSB Bank realized that his “divorce” two years ago with the banking group Lloyds (both companies merged in 1995) is too expensive. The TSB was still tied to its former partner through hastily cloned Lloyds IT systems. And the worst part was that the bank had to pay “child support” – deductions in the form of annual license fees of $ 127 million.

Few people like to pay money to their exes, so on April 22, 2018 at 18:00, TSB began the final stage of an 18-month plan that was supposed to change everything. It was planned to transfer billions of customer records to the IT system of the Spanish company Banco Sabadell, which bought TSB for $ 2.2 billion back in 2015.

Banco Sabadell CEO Jose Olyu spoke about the upcoming event 2 weeks before Christmas 2017 during a festive staff meeting in a prestigious conference room in Barcelona. The most important migration tool was to be the new version of the Banco Sabadell system: Proteo. It was even renamed Proteo4UK specifically for the TSB migration project.

At the Proteo4UK presentation, Jaime Guardiola Romoharo, Banco Sabadell Executive Director, boasted that the new system is a large-scale project that has no analogues in Europe and has been worked on by over 1000 specialists. And that its implementation will give a significant impetus to the growth of Banco Sabadell in the UK.

The day of migration was appointed April 22, 2018. It was a quiet Sunday evening in the middle of spring. Bank’s IT systems were disabled because records were transferred from one system to another. With the restoration of public access to bank accounts late Sunday evening, one could expect the bank to slowly and smoothly return to operation.

But while Olya and Guardiola Romoharo were joyfully broadcasting from the stage about the implementation of the Proteo4UK project, the staff responsible for the migration process were very nervous. The project, which took 18 months, was seriously behind schedule and exceeded the budget. There were no time to carry out additional tests. But the transfer of all company data (and this, recall, billions of records) to another system is a titanic work.

It turned out that the engineers were not nervous for nothing.

A stub on a site that customers have seen for too long

20 minutes after the TSB opened access to the accounts, being completely sure that the migration went smoothly, the first reports of problems came.

Accumulations of people suddenly disappeared. Purchases of small amounts were mistakenly recorded as thousands of expenses. Some people logged in to their personal accounts and did not see their bank accounts, but accounts of completely different people.

At 9 p.m., TSB representatives told the local financial regulator (UK Financial Regulatory and Supervisory Authority, FCA) that the bank had problems. But FCA has already paid attention to this: TSB really screwed up a lot, and customers were fools. And, of course, they began to complain about social networks (and nowadays, writing a couple of lines on Twitter or Facebook is not difficult). At 23:30, another financial regulator, the Prudential Regulation Authority (PRA), contacted FCA, who also felt something was amiss.

Already deep after midnight, they managed to get through to one of the representatives of the bank. And ask them the only question: “what the hell is going on?”

It took time to understand the scale of the tragedy, but now we know that during the migration 1.3 billion records of 5.4 million customers were damaged. For at least a week, customers were unable to manage their money from a computer and mobile devices. They did not manage to pay the loan, and many of the bank’s customers received a spot in their credit history, as well as late fees.

This is what the online TSB customer bank looked like

When the failures began to appear, almost immediately after that, the bank representatives assured that the problems were “periodic”. Three days later, a statement was issued that all systems are normal. But customers continued to report problems. Only on April 26, 2018, Bank CEO Paul Pester admitted that TSB is “on its knees”, as the bank’s IT infrastructure still had a “bandwidth problem” that prevents access to online banking services for about a million customers.

Two weeks after the start of the migration, crashes were still reported in the online banking application, which generated internal errors related to the SQL database.
Difficulties with payments, especially with business accounts and mortgage accounts, lasted up to four weeks. And ubiquitous journalists found out that the TSB rejected the offer of help from Lloyds Banking Group at the very beginning of the migration crisis. In general, the problems associated with entering online services and the possibility of transferring money were observed until September 3.

A bit of history

The first ATM was opened on June 27, 1967 near Barclays in Enfield.

Banking IT systems are becoming increasingly complex, as customer needs and their expectations from the bank are growing. About 40-60 years ago, we would be glad to visit the local branch of the bank during working hours to deposit cash or withdraw it through the cashier.

The amount of money in the account was directly related to the cash and coins that we transferred to the bank. Our home accounting could be tracked with a pen and paper, and computer systems were not available to customers. Bank employees put data from passbooks and other media into devices that counted money.

But in 1967 for the first time in north London Was installed ATM, which was not located on the territory of the bank. And this event has changed banking. User convenience has become a guideline for the development of financial institutions. And this has helped banks become more sophisticated in terms of working with clients and their money. After all, while computer systems were available only to bank employees, they were satisfied with the previous, “paper” way of interacting with a client. And only when there were ATMs, and then online banking, did the general public get direct access to the bank’s IT systems.

ATMs were just the beginning. Soon, people were able to avoid the queue at the cashier by simply calling the bank on the phone. This required special cards inserted into a reader capable of decrypting dual-tone multi-frequency (DTMF) signals transmitted when the user pressed the “1” (withdraw money) or “2” (deposit money) keys.

The Internet and mobile banking have brought customers closer to the main systems that support banks. Despite various limitations and settings, all these systems must effectively interact with each other and with the main mainframe, checking the account balance, making money transfers, and so on.

Few clients think about how difficult the information goes when, for example, you go to an online bank to view or update information about money in your account. When you enter the system, this data is transmitted through a set of servers, when you make a transaction, the system duplicates this data in the backend infrastructure, which then does the hard work — transfers money from one account to another to pay bills, make payments and continue subscriptions.

Now multiply this process by several billion. According to data compiled by the World Bank through the Bill and Melinda Gates Foundation, 69 percent adults all over the world have a bank account. Each of these people must pay bills. Someone pays a mortgage or transfers money for children’s clubs, someone pays for a subscription to Netflix or rent a cloud server. And all these people use more than one bank.

Numerous internal IT systems of one bank (mobile banking, ATMs, etc.) should not just interact with each other. They need to interact with other banking systems in Brazil, China, and Germany. A French ATM should be able to issue money that is on a bank card issued somewhere in Bolivia.

Money has always been global, but never before has this system been so complex. The number of ways to use the bank’s IT systems is increasing, but old ways are still in use. A bank’s success largely depends on how “maintainable” its IT infrastructure is and how efficiently the bank can cope with a sudden failure, which will make the system stand idle.

No tests – get ready for problems

Banco de Sabadell CEO Jaime Guardiola (left) was confident that everything would go smoothly. Did not work out.

TSB computer systems were not very good at solving problems quickly. There were, of course, software failures, but in reality the bank “broke” due to the excessive complexity of IT systems. According to the report, which was prepared in the early days of the massive disruption, “the combination of new applications, the expanded use of microservices in combination with the use of two active (Active / Active) data centers led to a complex risk in the workplace.”

Some banks, such as HSBC, operate globally, and therefore also have very complex, interconnected systems. But they, according to one of the HSBC IT executives in Lancaster, are regularly tested, migrated, and updated. He sees HSBC as a model of how other banks should manage their IT systems: allocating staff and spending their time. But at the same time he admits that for a smaller bank, especially one that does not have migration experience, doing this correctly is a very difficult task.

TSB migration was difficult. And, according to experts, the bank staff could not trite to reach this level of complexity in terms of qualifications. In addition, they did not even bother to check their decision, test the migration in advance.

Speaking to the British Parliament on banking issues, Andrew Bailey, FCA Executive Director, confirmed this suspicion. Bad code probably caused the initial problems only in the TSB, but the interconnected systems of the global financial network meant that its errors were perpetuated and irreversible. The bank continued to see unexpected errors elsewhere in its IT architecture. Clients received messages that were meaningless or unrelated to their problems.

Regression testing could help prevent a catastrophe by identifying bad code before it was run in a production environment, and it did damage, creating errors that could not be rolled back. But the bank decided to go over the minefield, which he did not even know about. The consequences were predictable. Another match in this powder keg was the “optimization” of costs. What is it manifested in? The fact that earlier it was decided to do away with the backups stored in Lloyds, as they “ate” too much money.

British banks (and others too) are striving to achieve an accessibility level of “four nines”, that is 99.99%. In practice, this means that the IT system must be available at all times, and downtime is up to 52 minutes per year. The system of “three nines”, 99.9%, at first glance is not very different. But in fact, it means that downtime reaches 8 hours a year. For a bank, “four nines” is good, but “three nines” is not.

But every time a company makes changes to its IT infrastructure, it takes risks. After all, something may go wrong. Reducing changes can help avoid problems, while required changes need to be thoroughly tested. And at this point, British regulators drew attention.

Perhaps the easiest way to avoid downtime is to make fewer changes. But each bank, like any other company, is forced to introduce more and more useful opportunities for customers and its own business in order to remain competitive. At the same time, banks are still obliged to take care of their customers, protecting their savings and personal data, providing comfortable conditions for using services. It turns out that organizations are forced to spend a lot of time and money on maintaining the health of the IT infrastructure, while offering new services.

According to figures released by the UK Financial Regulatory and Supervisory Authority, the number of recorded technological failures in the UK financial services sector grew by 187 percent from 2017 to 2018. Most often, the cause of failures is a problem in the operation of the new functionality. At the same time, it is critically important for banks to ensure the continuous uninterrupted operation of all services and almost instant reporting of transactions. Customers are always nervous when their money hangs out in the middle of nowhere. A client who is nervous about money is always in trouble, a sure sign.

A few months after the failure in the TSB (by this time the bank’s CEO had resigned), the UK financial regulators and the Bank of England issued a document for discussion on operational sustainability. So they tried to raise the question of how deeply the banks went in the pursuit of innovations, and whether they can guarantee the stable operation of the system that is available now.

The document also proposed amendments to the law. It was about making employees within the company responsible for what went wrong in the company’s IT systems. The British parliamentarians explained it this way: “When you are personally responsible, and you can go bankrupt or sent to prison, this will greatly change your attitude to work, including increasing the amount of time devoted to the issue of reliability and security.”

Summary

Every update and fix comes down to risk management, especially when it comes to hundreds of millions of dollars. After all, if something goes wrong, then it can be costly in terms of money and reputation. It would seem obvious things. And the failure of the bank during the migration had to teach them a lot.

Had. But did not teach. In November 2019, TSB, which again returned to payback and slowly straightened its reputation, “delighted” customers new failure in the field of information technology. The second blow to the bank led to the fact that it would be forced to close 82 branches in 2020 in order to reduce its costs. Or he could simply not save on IT specialists.

Parsimony towards IT is ultimately taxed. TSB reported a loss of $ 134 million in 2018, compared with a profit of $ 206 million in 2017. The costs after migration, including compensation to customers, correction of fraudulent transactions (and their number increased sharply during the banking chaos), and the assistance of third-party specialists amounted to $ 419 million. The bank’s IT provider was also billed $ 194 million for its role in the crisis.

However, despite the lessons learned after the TSB bank failure, interruptions will still occur. They are inevitable. But thanks to testing and good code, the number of crashes and downtime can be significantly reduced. Cloud4Y, often helping large companies migrate to the cloud infrastructure, is well aware of the importance of quickly moving from one system to another. Therefore, we can carry out load testing, and use a multi-level backup system, as well as other options that allow you to check everything possible before starting the migration.

What else is useful to read on the blog Cloud4y

→ Salty solar energy
→ Pentesters at the forefront of cybersecurity
→ The Great Snowflake Theory
→ Internet by balloons
→ Do you need pillows in the data center?

Subscribe to our Telegram-channel, so as not to miss another article! We write no more than twice a week and only on business.