Hello, my name is Vladimir Kononenko and I – Head of Implementation Department at OTR Group of Companies. Everyone who works with information systems (IS) is familiar with the following story: at a certain stage of the IS life cycle, situations arise when the customer using the IS expresses dissatisfaction that everything slows down, users cannot work, the software cannot withstand the load, and so on. . In this article, I would like to tell you step by step how I solve problems with the performance and stability of ICs created by our company.
I will rely on personal experience and practice in recent OTR projects – we have a lot of cool specialists with a high level of expertise, there is a PostgreSQL competence center – as well as on my own knowledge: at one time I graduated with honors from Rostov State University with a degree in Mathematical Methods and operations research in economics” and received a certified black belt in the Six Sigma program from Arizona State University, USA. Using examples from my practice, you will see which tools I choose and how they work at each stage.
Where to begin
As I was taught at the university, there is no practice without theory. This means that in order to solve a practical problem, we need a methodology, on the basis of which we will act. I chose the concept and tools of lean manufacturing and 6 sigma for myself. The essence of the concept is the identification and elimination of costs that do not add value to the product and increase the stability of work processes.
So, what do we have at the beginning of our journey:
1) IP, to which there are claims;
2) The client who makes these claims – in the vast majority of cases – at a conditionally household level, without a clear description and metrics that would characterize the problem;
3) A team of our optimization specialists.
Having collected these three elements, we make first step.
Step 1. We make sure that the problem really exists
One of the principles of lean manufacturing helps us with this, which says: evaluate the problem with your own eyes – do not trust only reports. That is, I, as the head of the optimization team, meet with the user who most clearly reproduces the problem, and personally see what actions he performs and what consequences this leads to.
Not often, but there are cases when at this stage the problem is solved! For example, one of our customers, a financial institution, in one of the largest customer service departments, just had an unpleasant situation – customers could not connect to the financial document management system at times of peak activity. And so, having visited one of the major clients of the division, I see a picture of how an accountant connects to a hardware-software encryption complex (APKSh) to provide a secure channel, is denied connection, and proudly declares: “You see, I can’t join your program!
APKSh has nothing to do with the IS we created, but we need to help the client – so we return to the customer’s subdivision, where the APKSh is located, and begin the analysis. We find out that the unit has 2 APKSh modules installed, their total productivity is equal to the number of clients of this unit. But the distribution of clients according to these modules looked like this: on one – 3 vip-clients, and all the rest – on the other. This led to the fact that at the time of peak client activity on one module there was simply not enough bandwidth.
Oh, if all problems were solved so simply, but, unfortunately, it is not. Therefore, after we make sure with the end user that the problem is real and really related to our IP, we move on to next step.
Step 2. Tidy up
After reviewing the problems faced by IS users, we turn to the 5S tool of lean manufacturing – putting things in order. In our case, this is putting things in order in the customer’s infrastructure.in his data center.
Let’s return to our client – a financial company and its large division. After solving the problem with the organization of a secure channel in the form of a redistribution of users by APKS modules, we proceeded to analyze the reasons for slow work, as well as a long login already in our workflow system. And we start this analysis from two directions:
a) Analysis of resource allocation settings for the java application of the authentication server in terms of long-term login to the IS
Here another surprise awaited us – in the memory allocation settings, we found a value less than that of the smallest division of this organization. The answer to the question “Why so little?” surprised us even more – “But there is no more on the server …”. That is, the authentication module worked on an old and very weak server in the largest division of the organization – how so? The result of the analysis showed that a few years ago, very powerful servers were purchased for this module, which were mounted and installed in the data center of the division, turned on and … they have been just heating the air for several years – they simply forgot to transfer the authentication module to them! Well, what to do, we helped to transfer our authentication module to its rightful place and the problem with a long login was solved.
b) Collection and analysis of Nmon statistics on the operation of iron in terms of the slow operation of the application
Having collected statistics on the operation of the DBMS server, we found that the I / O speed on disk storage leaves much to be desired – it is several times less than the norm. The analysis showed that the RAID array of disk storage was assembled from disks of different speeds, but the most interesting thing was that next to such an industrial disk storage there was a storage system with flash memory – not in operation, but existing as a reserve. I think you can guess that very soon this backup storage system became industrial, and the problem of slow operation of application software (PPO) was solved on this.
This is how putting things in order in the data center helps to solve the performance problems of the application software!
But what to do when the IT department of the customer knowingly eats its bread and from the side of the IT infrastructure everything works like clockwork, there are no problems with the load on the hardware, but when a certain level of load on the IS is reached, it begins to slow down significantly during performing certain operations? To resolve this issue, we move on to the next step.
Step 3. Finding the Root Source of the Problem
Very often it may seem that the problem of poor performance lies in one IC module, but a detailed analysis reveals the root of the problem in a completely different place, despite false bright manifestations.
Let’s explain with an example – one of our customers has a very large and highly loaded IS with a dozen business modules and a separate integration layer that provides interaction between them. At the same time, integration interactions were built using synchronous web services. The load on this integration layer was significant, and during peak periods of user activity, slowdowns began to occur in it, up to a complete hang of the integration, there were complaints that “this integration has died again.”
The analysis of the problem revealed that when a document is sent for processing to one of the business modules, the integration calls a synchronous service and waits for a response about successful or unsuccessful processing in order to return a response to the module that sends the document. At the time of peak load, the receiving module began to process unacceptably slowly and, thereby, kept a large number of integration connections open to it. As a result, at some point, all connections were occupied at the integration layer – it stopped responding to other modules, which led to collapse.
Thus, the reason for the instability of the integration turned out to be the slow execution of operations in one of the business modules, and by solving this problem, we were able to return the system to balance and stable operation.
It must be mentioned here that very a great help in such an analysis is provided by well-built logging of the work of the IS, in particular, logging the duration of business and technological operations, as well as the use of tools for working with logs. For example, in our case, it was the ELK stack (Elasticsearch, Logstash, Kibana).
However, finding the true bottleneck is only half the battle, you also need to figure out what to do with it. Finding a solution can sometimes be non-trivial, especially when everything looks very good from the point of view of the code, and there are no slowdowns from the point of view of database calls either – that is, the operation is actually heavy and really slow. What to do in this case?
Go to the final step.
Step 4 Eliminate Losses
If there are no obvious solutions to the problem, to optimize a heavy process, you need to redesign the operation itself Eliminate activities that do not add value and waste resources.
By tradition, let’s consider an example – let’s return to the first steps and our client, a financial institution and its divisions. One of the divisions receives financial applications from clients with several signatures of the client’s employees, then in the organization’s division comes the multi-stage control of these applications and their signing by the employees of the division. As a result, applications come in packs for approval to the head of the department – packs of several thousand pieces, and there can be several dozen such packs per day.
We faced the problem that the closer the document is to the approval of the head of the department, the more time is spent on its approval – the head could take up to 2 hours for a pack of documents. The analysis showed that the longest operation during approval at each stage was the verification of all superimposed signatures on the document before that – naturally, with the growth in the number of approvals, the time to verify an increasing number of electronic signatures also increased. At the same time, it was not expedient to check all signatures – to guarantee the invariability of the document, it is enough to check the validity of the first and last signature at each step.
Eliminating these redundant steps from the business transaction allowed us to achieve an acceptable time to complete the approval and approval operations.
Well, one more example: in an IS consisting of two business modules:
Business module 1. Responsible for business processing of documents and formation of accounting records;
Business module 2. Responsible for the storage of accounting records and reporting;
A problem occurred while transferring accounting records from the first module to the second.
The transmission was in batches of 10,000 records, and the packet processing speed in the second module took up to 30 minutes, which turned out to be too long and led to the formation of a queue of unprocessed packets. The essence of the batch processing operation was quite simple: for each entry from the balance sheet of accounting entries, the required balance was found with the required code combination of accounts and analysts, and with a period corresponding to the entry period. The balance was recalculated depending on the record data from the package, and saved back to the table – to process the package, such an operation was carried out 10 thousand times.
By applying the Pareto principle to the data contained in the package, we found that the number of unique code combinations used in accounting entries does not exceed 300, and 90% of them are 10 combinations. We have changed the algorithm of work: now, when importing a package, the records are first grouped by combinations and the temporary balance is calculated for each combination from the package, and only after that 300 balances are retrieved, in which temporary balances were already taken into account. This approach made it possible to reduce the search for and saving balances in the database by dozens of times, thereby bringing the package processing time to the standard value – it takes less than 1 minute!
This is the Lean 6 Sigma algorithm I use to fix performance and stability issues in our ICs. I hope it was helpful. Share what algorithms or methods you use to solve information system performance problems. What works in your industries?