Delayed completion of pod & # 039; while deleting it

Kubernetes pod shutdown delay

This is the third part of our path (approx. Lane – link to the first article) to achieve zero downtime when updating a Kubernetes cluster. In the second part, we reduced the downtime that arose due to the forced termination of applications running in pod’s, ending them correctly using lifecycle hooks. However, we also learned that pod can continue to receive traffic after the application in it has started shutting down. That is, the client may receive an error, because its request will be directed to the pod, which can no longer serve the traffic. Ideally, we would like pods to stop accepting traffic right after the start of the eviction. To reduce the risk of downtime, we first need to understand why this is happening.

Most of the information in this post was taken from Marco Luksha’s book “Kubernetes in Action” You can find an excerpt from the corresponding section. here. In addition to the material described here, the book provides an excellent overview of best practices for running applications in Kubernetes, and we highly recommend reading it.

Pod’s shutdown process

In a previous post, we described the lifecycle of a pod’s eviction. As you remember, the first step in the eviction process is the removal of the pod’a, which triggers a chain of events that leads to the removal of the pod’a from the system. However, we did not talk about how pod ceases to be endpoint for Service and stops accepting traffic.

So what causes the removal of the pod’s from Service? We will understand this if we take a closer look at the process of removing pod’s from the cluster.

When a pod is removed from the cluster using the API, all that happens is its marking on the server with metadata as to be deleted. It sends a pod’s removal notification to all relevant subsystems, which then process it:

  • kubelet starts the uninstall process described in a previous post.
  • Demon kube-proxy remove IP pod’s from iptables from all nodes.
  • Endpoints controller will remove pod from the list of valid endpoints, including removing pod from Service

You do not need to know each system in detail. The main point here is that several systems are involved on different nodes, and the removal process occurs on them in parallel. Therefore, it is likely that pod will launch a hook preStop and will receive a signal TERM much earlier than the pod is removed from all active lists. Therefore, pod will continue to receive traffic even after the shutdown process is started.

Mitigating the problem

At first glance, we can decide that we need to implement an approach where the pod does not turn off until it is excluded from all lists in all relevant subsystems. However, in practice this is difficult to do due to the distributed nature of Kuberenetes. What happens if a connection is lost with one of the nodes? Will you wait forever for changes to apply? What if the node becomes online again? What if you have thousands of nodes? Tens of thousands?

Unfortunately, there is no ideal solution where the downtime would be reduced completely to 0. What we can do is introduce a delay in the shutdown process to cover 99% of the problem cases. For this we introduce sleep to hook preStopwhich will delay the shutdown process. Let’s see how this works with an example.

We will need to update our config to delay it as part of the hook preStop. In “Kubernetes in Action”, Lukša recommends 5-10 seconds, so here we take 5 seconds:

     command: [
       "sh", "-c",
       # Introduce a delay to the shutdown sequence to wait for the
       # pod eviction event to propagate. Then, gracefully shutdown
       # nginx.
       "sleep 5 && /usr/sbin/nginx -s quit",

Now let’s see how the shutdown process will go on in our example. As in the previous post, we will launch kubectl drain, which will evict pod’s from the node. We will send an event that will notify you of the removal of the pod’s kubelet and Endpoint Controller (which controls Service endpoints) at the same time. Here, we assume that the hook preStop will start before the controller removes the pod.

Drain node will remove the pod, which in turn will send an event to delete

From the moment the hook preStop starts, the delay for the shutdown process begins for five seconds. During this time, the Endpoint Controller will remove the pod:

The pod is removed by the controller while the shutdown process lasts.

Please note that during the delay, the pod continues to work and, if it accepts the connection, it will be able to process it. In addition, if any client tries to connect to the pod when it is removed from the controller, they will not be sent to the disconnected pod. According to this scenario, provided that the controller processes the event during the delay time, we will not have any downtime.

In the end, to complete the picture, hook preStop completes execution sleep and turns off pod with Nginx, removing pod from the node:

From now on, you can safely make any updates to Node 1, including rebooting the node to load the new kernel version. We can also disable the node if we have already launched a new one, which can host working applications.

Recreating pods

If you have come this far, you should be wondering how we recreate the pods that were originally distributed on the node. Now we know how to correctly complete the work of pods, but what if it is very important to return to the initial number of running pods? In this place Deployment comes into play.

Resource Deployment called controller, and it does the job of maintaining a given desired cluster state. If you recall our resource config, we did not specify the creation of pod’s directly. Instead, we used Deployment to automatically manage pods by providing a template for creating pods. Here is the section with the template in our configuration:

       app: nginx
     - name: nginx
       image: nginx:1.15
       - containerPort: 80

Here it is determined that the pod’s in our Deployment must be created with a label app: nginx and in pod’s one container with the image will be launched nginx:1.15throws 80 port.

In addition to the pod’s template, we also set the spec in the resource Deployment, which determines the number of replicas that it should support:

 replicas: 2

This tells the controller that it should try to keep 2 pods running in the cluster. Each time the number of running pods decreases, the controller automatically creates a new one to replace the remote pod. Therefore, in our case, when we evict pod’s from a node using drain, Deployment the controller automatically re-creates it on one of the other available nodes.


In general, with a reasonable delay in the hook preStop and with the correct termination, we can turn off our pod’s correctly on the same node. And with the resource Deployment we can automatically recreate off pod’s. But what if we want to replace all the nodes in the cluster at the same time?

If we just do drain on nodes, this can lead to downtime, since the load balancer may not have any pods available. Worse, for a stateful system, we can destroy the quorum, which will lead to an increase in downtime, as new pods appear and they have to hold leader elections, waiting for the quorum of nodes to be reached.

If instead we try drain on each node in turn, we can get new pods running on the old nodes. There is a risk of a situation where, in the end, all pod’s replicas will be launched on one of the old nodes, and when we run on it drain, then we lose all replicas of our pod’s.

To cope with this situation, Kubernetes offers a feature PodDisruptionBudgets, which shows the allowable number of pods that can be turned off at a particular point in time. In the next and final part of our series, we will describe how we can use this function to control the simultaneous number of processes draindespite our straightforward call approach drain for all nodes in parallel.

For a fully implemented and tested version of Kubernetes cluster updates for zero downtime on AWS and other resources, visit

Also read other articles on our blog:

  • Zero Downtime Deployment and Databases
  • Kubernetes: why is it so important to set up system resource management?
  • Building Dynamic Modules for Nginx
  • Stateful backups in Kubernetes
  • Backing up a large number of heterogeneous web projects
  • Telegram bot for Redmine. How to simplify life for yourself and people

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *