Good to all!
My name is Nikita, I am team leader of Cyan engineers. One of my responsibilities at the company is to reduce the number of incidents related to infrastructure on the prod to zero.
What will be discussed later has brought us a lot of pain, and the purpose of this article is to prevent other people from repeating our mistakes or at least minimize their impact.
Once upon a time, when Cyan consisted of monoliths, and there were no hints of microservices yet, we measured the availability of the resource by checking 3-5 pages.
Answer – everything is fine, do not respond for a long time – alert. How much time they should not work in order for this to be considered an incident, people decided at meetings. A team of engineers has always been involved in the investigation of the incident. When the investigation was completed, they wrote a post-mortem – a kind of report to the post office in the format: what was, how long, what they did in the moment, what we will do in the future.
The main pages of the site or as we understand that we have broken the bottom
In order to somehow understand the priority of the error, we highlighted the site’s most critical pages for the business functionality. According to them, we consider the number of successful / unsuccessful requests and timeouts. So we measure uptime.
Suppose we find out that there are a number of super important sections of the site that are responsible for the main service – search and submission of announcements. If the number of requests that failed is greater than 1%, this is a critical incident. If within 15 minutes in prime time the percentage of errors exceeds 0.1%, then this is also considered a critical incident. These criteria cover most incidents, the rest are beyond the scope of this article.
Top best cyan incidents
So, we precisely learned to determine the fact that the incident happened.
Now each incident in our country is described in detail and reflected in the Jira epic. By the way: for this we started a separate project, called it FAIL – you can create only epics in it.
If you collect all the fails over the past few years, then the leaders are:
- mssql related incidents;
- incidents caused by external factors;
- admin errors.
Let us dwell in more detail on the mistakes of admins, as well as on some other interesting fails.
Fifth place – “Putting order in the DNS”
It was a rainy Tuesday. We decided to clean up the DNS cluster.
I wanted to transfer the internal dns servers from bind to powerdns, highlighting completely separate servers for it, where there is nothing besides dns.
We placed one dns server in each location of our DCs, and the moment came to move the zones from bind to powerdns and switch the infrastructure to new servers.
At the height of the move, of all the servers that were indicated in the local caching bind on all servers, there was only one that was in the data center in St. Petersburg. This DC was initially declared as uncritical for us, but suddenly became a single point of failure.
Just during such a period of relocation, the channel between Moscow and St. Petersburg fell. We actually remained without DNS for five minutes and got up when the hoster fixed the problems.
If earlier we neglected external factors during the preparation for work, now they are also included in the list of what we are preparing for. And now we strive to ensure that all components are reserved n-2, and for the duration of the work we can lower this level to n-1.
- When drawing up an action plan, mark the points where the service may fall, and think over the scenario where everything went “worse than nowhere,” in advance.
- Distribute internal DNS servers by different geolocations / data centers / racks / switches / inputs.
- On each server, put a local caching dns server, which redirects requests to the main dns servers, and if it is unavailable, it will respond from the cache.
Fourth place – “Cleaning up Nginx”
One fine day, our team decided that “enough to endure it”, and the process of refactoring nginx configs started. The main goal is to bring configs to an intuitive structure. Previously, everything was “historically established” and there was no logic in itself. Now each server_name was taken to the file of the same name and distributed all the configs into folders. By the way – the config contains 253949 lines or 7836520 characters and takes up almost 7 megabytes. Top level structure:
│ ├── allow.list
│ └── whitelist.conf
│ ├── exclude.conf
│ └── geo_ip_to_region_id.conf
│ ├── GeoIP.dat
│ ├── GeoIP2-Country.mmdb
│ └── GeoLiteCity.dat
│ ├── error.inc
│ └── proxy.inc
│ ├── bot.conf
│ ├── dynamic
│ └── geo.conf
│ ├── cookie.lua
│ ├── log
│ │ └── log.lua
│ ├── logics
│ │ ├── include.lua
│ │ ├── …
│ │ └── utils.lua
│ └── prom
│ ├── stats.lua
│ └── stats_prometheus.lua
│ ├── access.conf
│ ├── ..
│ └── zones.conf
│ ├── cian.ru
│ │ ├── cian.ru.conf
│ │ ├── …
│ │ └── my.cian.ru.conf
│ ├── …
│ └── status.conf
It became much better, but in the process of renaming and distributing the configs, some of them had the wrong extension and did not fall into the include * .conf directive. As a result, part of the hosts became unavailable and returned 301 to the main one. Due to the fact that the response code was not 5xx / 4xx, this was not noticed immediately, but only in the morning. After that, we started writing tests to test infrastructure components.
- Correctly structure configs (not only nginx) and think over the structure at an early stage of the project. So you will make them more understandable to the team, which in turn will reduce the TTM.
- For some infrastructure components, write tests. For example: checking that all key server_names return the correct status, + response body. It will be enough to have at hand just a few scripts that check the basic functions of the component so that you do not frantically remember at 3 a.m. what else needs to be checked.
Third place – “Suddenly ended the place in Cassandra”
The data was growing steadily, and everything was fine until the moment when repair of large cases started to fall in the Cassandra cluster, because compaction could not work on them.
On one rainy day, the cluster almost turned into a pumpkin, namely:
- places remained about 20% of the total cluster;
- it is impossible to fully add nodes, because cleanup fails after adding a node due to lack of space on partitions;
- performance drops slightly, because the compaction does not work;
- the cluster is in emergency mode.
Exit – 5 more nodes were added without cleanup, after which they began to systematically remove from the cluster and re-enter them as empty nodes on which the place ended. Time spent much more than we would like. There was a risk of partial or complete inaccessibility of the cluster.
- On all cassandra servers, no more than 60% of the space on each partition should be taken.
- They must be loaded no more than 50% cpu.
- Do not clog on capacity planning and think through it for each component, based on its specifics.
- The more nodes in the cluster, the better. Servers containing a small amount of data are migrated faster, and such a cluster is easier to reanimate.
Second place – “Data from consul key-value storage has disappeared”
For service discovery, we, like many, use consul. But here, its key-value is also used for blue-green monolith calculations. It stores information about active and inactive upstream, which change places during the deployment. For this, a deployment service was written that interacted with KV. At some point, the data from KV disappeared. Recovered from memory, but with a number of errors. As a result, during the calculation, the load on the upstream was unevenly distributed, and we got a lot of 502 errors due to overloading the backends on the CPU. As a result, we moved from consul KV to postgres, from where it is not so easy to remove them.
- Services without any authorization should not contain data critical for the operation of the site. For example, if you do not have authorization in ES, it would be better to prohibit access at the network level from everywhere where it is not needed, leave only the necessary ones, and also make action.destructive_requires_name: true.
- Work out the backup and recovery mechanism in advance. For example, make a script in advance (for example, in python) that can backup and restore.
First place – “Captain non-obviousness”
At some point, we noticed an uneven load distribution on the nginx upstream in cases where there were 10+ servers in the backend. Due to the fact that round-robin sent requests from 1 to the last upstream in order, and each nginx reload started from the beginning, the first upstream always had more requests than the rest. As a result, they worked more slowly and the whole site suffered. This became more noticeable as the amount of traffic increased. Just updating nginx to enable random did not work – you need to redo a bunch of lua code that did not take off on version 1.15 (at that moment). I had to patch our nginx 1.14.2, introducing random support in it. This solved the problem. This bug wins the “captain non-obviousness” nomination.
It was very interesting and exciting to investigate this bug).
- Set up monitoring so that it helps to find such fluctuations quickly. For example, you can use ELK to monitor the rps on each backend of each upstream, and monitor their response time from the point of view of nginx. In this case, it helped us to identify the problem.
As a result, most of the fails could have been avoided with a more scrupulous approach to what you are doing. We must always remember the Murphy’s law:
Anything that can go wrong will go wrong,
and build components guided by it.