how to avoid disaster

All happy families are equally happy; every unhappy family is unhappy in its own way.

“Anna Karenina” L. Tolstoy.

After conducting hundreds of IT project audits and investigating dozens of incidents, to paraphrase a classic, we can say: “All projects are happy in their own way, but unhappy in the same way.” The success of a project is a unique combination of parameters and people. And misfortunes bring the same problems or a set of problems, small omissions, which eventually lead to a peak. I will tell you about one such case today, I am Yulia, an expert of the SimbirSoft Quality Service.

Usually, few people react to small deviations from the standards of project management, compliance with processes: “Nothing terrible will happen.” This is fraught with big problems when specialists do not understand the risks, do not understand why this or that rule, agreement, and what can happen in the end if this is not done. Because, in fact, the processes and rules are flexible (in the best traditions of agile :)) But the little things and seemingly insignificant things can actually have a serious impact on the entire project. And lead to disaster…

We at SimbirSoft have the process of recording and analyzing incidents. First, of course, you need to eliminate the consequences as soon as possible. But the main goal is to understand why the incident happened and what needs to be done so that it does not happen again in the future. Just as in aviation they investigate air crashes, so in IT we have an investigation of incidents, which is handled, among other things, by the company’s quality service (sometimes we jokingly call ourselves an “investigative committee”). But don’t think, we don’t arrange interrogations with torture 🙂 We in the company develop a culture of quality, when you can openly talk about mistakes, draw conclusions and learn from them.

But let’s return to the story of one incident on a mobile project that occurred at the end of December 2019, and then we will analyze what mistakes we made and how they could have been avoided.

26.12.2019

Last working week of the year. Nikolai, a backend developer of one of the projects, has been planning a New Year’s trip to St. Petersburg for a long time. The last push left is the release of new features.

QA specialist Gulya wrote in the chat that regression completed, the minor bugs that were already fixed, and it’s time to upload the application to the AppStore and Google Play.

The team really liked the mobile application for online training: both the training and the approach of its owner. Our company developed a turnkey application: from technical specifications, design, architecture to an icon on the phone of those who dream of making a dream figure. According to the classics, it consisted of two parts – server and mobile client (Android and iOS). The development team included an analyst, designer, backend, Android and iOS developers, a QA specialist and a PM. Dream team of people passionate about work and product. The app was already in the stores. The first release took place on October 20. And only in the first month of work there were already more than 50,000 downloads, more than 4100 ratings and reviews.

By the new year, the team planned to release new features and publish a new training course, the advertising campaign of which was already in full swing. Early downloaders got access to discounted workouts and additional videos on motivation and nutrition. The release of the product was expected by a large audience of users.

We had everything ready. They were only waiting for the end of the checks from the AppStore and Google Play.

12/30/2019

Project PM Anna informed the customer that the application has successfully passed the checks and can be released.

The client decided to hold it on December 31st. New Year new life! And we were young and desperate 🙂

31.12.2019

The new version has become available to users. Downloads and updates went, everything was in normal mode. The team has already shared plans for the New Year holidays, put the task tracker and project documentation in order, and combed the backlog. And calmly went to prepare salads.

19:54 (+3 GMT) The first alert about the high load on the database (DB). Then more and more.

20:35 (+3 GMT) DevOps Maxim wrote in the chat team that the server “does not feel well”, and he is already looking at the logs, trying to figure out what happened.

After reading the messages, the QA specialist of the project, Gulya, immediately entered the application and saw that some workouts were not opening or the information was loading for a long time.

PM Anna was sitting at the New Year’s table, and the phone vibrated with messages: in addition to congratulations, there was bad news from the team that the application was not working, and the client wrote that an error occurred when trying to publish a new training course.

“Evening ceases to be languid” 🙂 New Year’s Eve! At the client 11:00 on December 31. The difference with the team is 8 hours. And the disruption of the release of a new course of training.

According to the existing processes, the account manager Natalia notified the company’s leaders about the incident. Despite the late time and the holiday, it is urgent to save the release.

Backend developer Nikolai at that time was walking around St. Petersburg – on December 30 and 31, he issued a day off. But reading work chats at any time of the day or night, and even on vacation, is the addiction of most IT people. I contacted Max (Devops), and according to his description, there were suspicions of several problems at once. Nikolai wrote to PM and his manager, indicating that it was necessary to fine-tune caching on the DBMS side.

The team decided to urgently involve another backend developer Sergey (can you imagine his face? :)), since Nikolai cannot make changes. We also decided to increase server performance.

21:53 (+3 GMT) The message “Application Error” appeared on the application screen. Buck is completely dead.

22:46 (+3 GMT) Caching is already on sale, “iron” capacities have also been added. The application “started”. But it became clear that this was a temporary respite and only an opportunity to buy time to find and fix the problem. Caching and performance improvements smoothed the situation, but did not solve it.

00:00 (+3 GMT) Happy New Year!

Yes, yes, each team member celebrated at a computer or phone to connect at any time if necessary!

1.01.2020

Sergey found the main problem of the application and was already making changes, consulting with Nikolai along the way.

Corrected code on test. The QA specialist quickly connected and checked, found a couple of comments. Again edits and again test. Then a mini-regression (yes, yes, exactly a mini, wider than smoke, but there is no time for a full-fledged one).

02.01.2020

All fixes are on sale. It was possible to exhale, but still it was necessary to observe the load on the DBMS.

01/03/2020

Redis cache issues. The application was unavailable at the client’s place at night, the problem was solved by resetting the cache.

01/04/2020

18:40 (+3 GMT) The application crashed for 3 minutes, the user cache was reset, a new option was made to reset the cache and work.

01/05/2020

Monitoring without anomalies. The flight is normal. Exhaled. The task was set to implement API 2.0.

01/09/2020

On the first working day after the New Year holidays, PM Anna made a new entry in the incident tracker about problems with the application after the release to production.

The quality service assembled a working group to investigate the incident: it was necessary to conduct a technical analysis of the mistakes made, establish the causes of their occurrence, and develop an algorithm of actions to prevent a similar situation in the future.

25.02.2020

Release of the updated application. Both the backend and the mobile application (Android, iOS) have been improved. A single request for the entire training structure was divided into several, in accordance with nesting levels. Progress in training also “left” into separate routes.

The team performed load testing using examples of new, more complex training data structures.

Subsequent new features – free training, announcements, multilingualism – already went smoothly, without excesses and difficulties.

Mistakes made and how they could have been avoided

1. Incorrect construction of queries

PM suggested that users can play sports on workouts from the application in places where there is not very good connection (hello to the gyms in the basement from the children of the 90s :)). The back-end developer agreed to PM’s suggestion to make one big request to download all the information about all the workouts, not assuming how much the volume of downloaded information would increase after that. This simplified the work of mobile developers, but violated the REST principle – instead of many GET requests to get different levels of training nesting, there was one request that loads all levels at once.

In addition to the course structure, the response from this request also contained the user’s progress in training, that is, additional requests to the database. In addition, getting the training structure itself added the N+1 problem due to the use of lazy associations.

This was not visible on the test data. However, on the sale the problem blossomed in all its glory.

How it should have been done:

If there are hierarchical data models, the response to a data retrieval request of any model should not contain information about child resources. Each data model must have its own path. The backend developer had to defend his opinion, give arguments. And even better – together with the team to form the pros and cons of the two solutions, the possible risks. And on the basis of this information to make a decision.

2. The test data did not match the real

The nested data structure of real training turned out to be several times more complicated than we imagined. One large query with such a complex multi-level structure was fatal to the performance of the application.

Sample workouts from a customer for data testing turned out to be much simpler than real workouts that were then entered into the application.

How it should have been done:

  • Always ask for sample test data. The bigger, the better. But not once. If there are plans for new features, then it is important to request them again. In our case, it was necessary to request an example of a new training course.

  • Determine restrictions on the type, volume, data structure. Coordinate with team and client.

    If possible, test both simple and complex examples. I understand that most often the time for testing is limited, so here we act without fanaticism.

  • Ask leading questions: “What other changes can there be in the application in the future?”, “Development plans?”, “How will this affect the data?”.

3. Closing the connection to Redis was not implemented

In an explicit form, there was no closing of the connection, it closed itself, but hung for a long time. The implementation of the functionalities was based on the Symphony documentation and, in fact, was implemented “on the forehead”, without deepening the features of the framework and the doctrine.

In the first iteration of fixing problems, the applications increased server performance and expanded caching – the entire training structure was cached. It helped, but not completely. Since with any changes in the structure of training in the admin panel, the entire cache of the training group was reset while it was being re-formed, some users could receive errors. Therefore, we added cache warming. After that, the application began to work more stable, and we continued to consistently work on fixing it.

How it should have been done:

This point arose because of the first two and emergency edits. Correctly building work with the cache and highlighting the first two problems would allow timely performance of load tests.

What did we end up changing in our processes?

  1. Introduced architectural supervision in each production line of development. We conduct a review of the developed architecture at project milestones that determine PM.

  2. When testing, we focus on relevant test data. We update examples when developing new features.

  3. On each project, we necessarily analyze the need for load testing. We developed a checklist for identifying performance requirements, a template for performance and load requirements.

  4. To ensure the required performance parameters of the product, we carry out minimum load testseven if the client believes that they are not needed.

  5. Introduced a moratorium for releases on pre-holiday days and on Fridays.

What can I say in conclusion. When I sat down to write a story about this case, I even had the thought: “How did we miss this? We have control at the level of architecture and collection of requirements. And then I realized – yes, it was just after this incident that they introduced it :)))

Development without errors is in an ideal world. The most important thing is to draw conclusions, learn from mistakes and continuously improve the processes of creating software products. All the best!

Thank you for your attention!

We also publish more useful materials for developers and managers in IT in our social networks – VC And Telegram.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *