Verify that your application is ready to operate in the real, unreliable world. Part 1

Shares experience Vitaly Likhachev, SRE at booking.com and Slurm course speaker “Golang developer” He talks about what you should think about before rolling out a service to hardcore production, where it may not be able to cope with the load or degrade due to sudden surges during the influx of users and in the evenings.

Consider this a bit of a checklist, but don't apply all the points as is, because every system is unique and sometimes it is perfectly acceptable to build a less reliable system in order to significantly reduce development, support and operation costs (for example, no redundancy). However, there must be backups 🙂

We will not translate some terms not because of the laziness of the author, but because of the stability of terms in literature.

An attentive reader can attribute individual points to several top-level sections at once. Therefore, the division into subgroups is quite arbitrary.

If the question arises, why is some X not described in the checklist, then the answer is simple: the article is already huge, and you will probably be able to find something useful for yourself.

The article consists of 5 parts, which will be published in turn:

1. Reliability.

2. Scalability/fault tolerance.

3. Resiliency/fault tolerance.

4. Security. Development process. Roll-out process.

5. Observability. Architecture. Antipatterns.

Reliability

Automated capacity planning

This is the process of planning the resources that will be required for the stable operation of the application in the future. Automating this process could involve using application scaling in k8s.

How does this work in the wild?

For example, in your application there are 10 replicas, to which, according to the random/round robin/etc. traffic is distributed, with 5 application replicas located in one availability zone, 5 replicas in another. What happens if one AZ becomes unavailable? Capacity planning can answer exactly this question. Without turning off the service replicas, using an automated traffic management tool (almost certainly some kind of service mesh with an envoy proxy), we begin to change the weight of one of the service replicas on the fly in such a way that gradually, instead of 10% of the total traffic, the replica begins to process 12-15- 20-…-N% of traffic.

At the same time, the automation tool necessarily monitors the percentage of errors and stops changing the weight of the replica when the number of errors exceeds a certain threshold, usually quite low (1%, 0.1%). As soon as the percentage of errors is exceeded, the weight of the replica is reduced back to 10%, and a report is generated (or metrics are checked in Grafana, for example), and thus we understand how much traffic one service replica holds with given resources. Here we are not talking about databases/caches, etc., which are located behind the service, because their survival in the event of AZ failure is a separate extensive topic.

Famous bottlenecks

Identifying and eliminating bottlenecks is an important step to ensure reliability. Bottlenecks can occur at the processor, individual processor core, memory, network, or I/O levels. To identify them, performance profiling and monitoring using tools such as Prometheus or Grafana are usually used. This is a very multifaceted topic that can greatly depend on the tools used.

Specifically for Golang, the following options can be offered:

Profiling using the built-in pprof tool. It allows you to collect data on application performance: processor load, memory consumption, goroutine execution time and other indicators. It can help identify bottlenecks at the processor (CPU) level or inefficient memory usage.

How does this work: profiling handlers are added to the code and can be called to generate reports. This data is analyzed to find goroutines or operations that are consuming excessive amounts of resources.

For example

package main

import (
	"fmt"
	"runtime"
)
import "runtime/pprof"
import "os"
import "time"

func main() {
	go leakyFunction()
	time.Sleep(time.Millisecond * 100)
	f, _ := os.Create("/tmp/profile.pb.gz")
	defer f.Close()
	runtime.GC()
	fmt.Println("write heap profile")
	pprof.WriteHeapProfile(f)
}

func leakyFunction() {
	s := make([]string, 3)
	for i := 0; i < 1000000000; i++ {
		s = append(s, "magical pprof time")
	}
}

And let's check the profile

go tool pprof /tmp/profile.pb.gz
File: main
Type: inuse_space
Time: Oct 23, 2024 at 3:45pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 74.47MB, 100% of 74.47MB total
      flat  flat%   sum%        cum   cum%
   74.47MB   100%   100%    74.47MB   100%  main.leakyFunction

We see the culprit of memory usage – leakyFunction. The example is simplified, but in real large programs you can still guess where memory is leaking, where CPU consumption is high, and so on. This will help further identify areas that need to be profiled.

Known hard addictions

What if your application is not viable without a database connection? Is it possible to make it more reliable?

Just an example – you know in advance that a small subset of data is retrieved from the database most often, and this data can be cached in-memory at the level of each service replica. This, of course, needs to be done carefully, with knowledge of the specifics of the data and a precise determination that this particular data is high-frequency in terms of the frequency of access to it. You also need to think about cache invalidation, about preventing OOM, i.e. limit cache size. But in general, this is realistic for certain types of data.

Imagine that you have an application that provides current prices of goods for a listing in the catalog of a large online store. Of course, there will be current promotions, top products and other similar high-frequency things. To turn a hard dependency into a partially soft dependency, it is important to understand how to identify high-frequency data and how to invalidate the cache. This will help reduce dependence on the database and, as a bonus, reduce the load on it.

At the purchase/payment stage, you will still have to go past the cache so as not to accidentally deceive the user. And on this slippery slope, you need to think carefully about how to cache different data and how to invalidate the cache.

Known soft dependencies

Here, in relation to hard dependencies, everything is simpler.

A dependency is soft if a certain business process can be performed without it without greatly affecting the user experience.

For example, what if we have a product card and were unable to load a product rating from the ratings service? We can draw a card by simply hiding the rating or doing some kind of fallback instead of showing the rating.

Here an attentive reader may ask why the rating is stored separately from the product? A valid question, but in large systems, databases are often divided depending on load patterns, and the rating can also be a large separate subsystem with analytics, recommendations, etc. — a whole separate world, so even such a seemingly small part of the system is transferred to separate services and separate databases

Known traffic patterns

Understanding traffic patterns allows you to better prepare your application for peak loads. For example, if the business logic suggests unexpected load increases of 10 times the average (a typical story for sales), then we use load tests (see below) to obtain application performance metrics and to identify bottlenecks.

SLI/SLO/SLA

In simple words:

SLI – specific metrics showing the ratio of successful/all requests (for example, http_200_count/(http_all_count)).

SLO – internal agreement on error limits. Typically more stringent than SLA. For example, 99.9% of all requests should be executed correctly.

SLA – an external agreement – often with a legal component that establishes sanctions against the company if the SLA is violated. Usually this is some kind of refund of funds paid for downtime, or discounts on resources in proportion to downtime. The topic is complex and non-trivial, even down to complex principles for calculating downtime, which may not coincide with feelings.

A mandatory requirement for medium/high critical services in most (and maybe even all) bigtech. If, when the service degrades (the number of errors has increased, latency has increased), no important user scenarios suffer in such a way that this affects the overall perception of the application, then most likely such a service is low critical, and for it it is not so important to determine SLI/SLO, or to determine a pretty relaxed SLO like 95%.

However, for important services, a background of errors/unavailability is usually acceptable at a fairly small interval. Classic values for high critical services are set at a minimum of 99.9% or 99.95%.

And here it is important to separate the background of errors and inaccessibility.

If the service is unavailable for 1 minute and this falls within the SLO within the specified interval (week, month), then nothing is done about it. However, if the service is almost always available, but the background of errors creates problems for users, then the concept of error budget is introduced, when it is not unavailability as such that is considered, but how many errors in relation to the total number of requests we received in the agreed time interval. Thus, even for an “always” available application, if the error budget burns too much, measures are taken to reduce errors.

In the next part we will look at scalability and fault tolerance.

You can improve your Go development skills and put together a full-fledged service for your portfolio in the course “Golang developer”

Verify that your application is ready to operate in the real, unreliable world. Part 1