How to update without downtime and stress
Preparation
When planning Rolling Updatesit is important to understand that we are talking about a gradual update of the system or application without completely disabling the service.
A successful Rolling Update requires a robust infrastructure and monitoring system. First of all, you will need an orchestrator that can manage parallel processes, such as Kubernetes.
A monitoring system is not just an option, but a foundation. It is necessary to track key metrics.
Example configuration for Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 2
template:
spec:
containers:
- name: my-app-container
image: my-app:latest
Set the strategy RollingUpdate
with parameters maxUnavailable: 1
(maximum one unavailable replica) and maxSurge: 2
(maximum two new lines at a time).
Experienced SRE teams carefully plan batches based on system requirements and current workloads. This is where Service Level Objective (SLO)which defines acceptable levels of availability and performance.
The standard recommendation is to update no more than 25% of the system at a time. However, everything depends on the specifics of the infrastructure. For example, in microservice architectures, this number can be higher if each service is isolated and does not have a strong impact on other parts of the system.
Before starting the update, test the batches on isolated servers or test environments. This way, you can predict the behavior of the real system and adjust the update parameters.
Properly configured monitoring is the key to a successful Rolling Update. Several metrics:
Downtime: A metric that measures service interruptions. It should approach zero under ideal conditions. Small increases may be acceptable if an upgrade requires replacing entire nodes.
Percentage of updated nodes: a metric that displays the current progress of the deployment.
System performance: monitor CPU, RAMAnd disk loadIt is especially important to monitor for abnormal spikes during updates.
Implementing and Managing Rolling Updates in Kubernetes
An example of a simple Rolling Updates configuration in Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 4
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app-container
image: my-app:v1
selector:
matchLabels:
app: my-app
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
Rolling Updates are done using parameters maxUnavailable
And maxSurge
. The first one is responsible for the number of pods that may be unavailable during the upgrade, and the second one is responsible for the number of new pods that can be created before the old ones are deleted.
To start the update, simply change the container version using the following command:
kubectl set image deployment/my-app my-app-container=my-app:v2
Kubernetes will begin upgrading in real time, gradually replacing pods of the old version with new ones.
You can monitor the process using the command:
kubectl rollout status deployment/my-app
If something goes wrong, you can quickly roll back the update:
kubectl rollout undo deployment/my-app
Kubernetes supports mechanisms to check container readiness via reading probes and their survivability through liveness probes. These checks ensure that updated containers are ready to receive traffic and that the application is stable. An example of a check configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-health-app
spec:
replicas: 4
template:
metadata:
labels:
app: my-health-app
spec:
containers:
- name: my-health-container
image: my-health-app:v1
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
We use readinessProbe
to check the readiness of the container, and livenessProbe
to monitor its status. This way, only pods that have successfully passed the check remain active, increasing reliability and reducing the risk of failures.
It is also important to remember that more complex systems require flexibility. Kubernetes allows you to use percentage values in parameters. maxUnavailable
And maxSurge
. For example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-advanced-app
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: "20%" # Динамическое значение для большого числа подов
maxSurge: "50%" # Ускорение деплоя за счет увеличения количества новых подов
template:
metadata:
labels:
app: my-advanced-app
spec:
containers:
- name: my-advanced-app-container
image: my-advanced-app:v2
In this case, percentage values are used, which allows dynamic management of the Rolling Updates process for large applications with many replicas.
During Rolling Updates, unexpected situations may arise, such as containers failing readiness checks or performance issues on individual pods. In such cases, you need to roll back the updates:
kubectl rollout undo deployment/my-app
This rollback returns the system to a previous stable state.
Rolling Updates can be integrated with CI/CD tools to automate and speed up the deployment process.
Example of pipeline configuration with Jenkins:
pipeline {
agent any
stages {
stage('Build') {
steps {
sh 'docker build -t my-app:v2 .'
}
}
stage('Deploy') {
steps {
sh 'kubectl set image deployment/my-app my-app-container=my-app:v2'
}
}
stage('Monitor') {
steps {
script {
timeout(time: 10, unit: 'MINUTES') {
def status = sh(script: 'kubectl rollout status deployment/my-app', returnStatus: true)
if (status != 0) {
error "Rolling Update failed!"
}
}
}
}
}
}
}
A simple Jenkins pipeline that builds a Docker image, deploys to Kubernetes, and monitors update status.
Best practices for managing service reliability, availability and efficiency can be learned on the online course “SRE practices and tools”.