What CrashLoopBackOff says about the pod's state and what are its causes

Are CrashLoopBackOff errors slowing down your deployments? In this article, we'll talk about why pods get stuck in the CrashLoopBackOff state, how to tell if a pod is in this state and why, with examples. This article is a translation of the material by QA engineer Omkar Kulkarni from Encora Inc.

Kubernetes Pod Lifecycle Phases

When a pod (resource) configuration YAML file arrives in Kubernetes, the Kube API server validates the configuration and makes it available. At the same time, the scheduler monitors new pods and distributes them to nodes based on their resource preferences. A pod has four lifecycle phases:

Meaning

Description

Pending

The first step after creating a pod. The Kubernetes scheduler places the pod on a node

Running

A pod enters this state after a container has been successfully created.

Succeeded

A pod enters this state after the main container has successfully completed its work.

Failed

A pod enters this state if any of its containers crashes or its exit code is nonzero.

You can find out what phase the pod is in using the following command: $ kubectl get pod .

Container states in pods

As mentioned above, pods have different phases. Similarly, Kubernetes tracks the state of containers inside pods. There are three such states: Waiting, Running, and Terminated. Once a pod is scheduled to a node, the kubelet starts launching its containers in the corresponding runtime.

You can check the status of the container using the following command: $ kubectl describe pod <имя пода> .

State

Description

Waiting

The state indicates that the container is still in the process of being created and is not yet ready for use.

Running

Everything is fine: resources (CPU, memory) were allocated successfully, the container process started

Terminated

The container process has stopped working.

What is CrashLoopBackOff

In Kubernetes, state CrashLoopBackOff indicates that the pod is stuck in a restart loop. This means that one or more containers in the pod failed to start.

In general, a pod is said to be stuck in a CrashLoop state when one or more of its containers repeatedly starts, crashes, and starts again.

What is BackOff Time and Why is it Important?

The BackOff algorithm, or backoff algorithm, is a simple technique used in networking and computer science to restart tasks in the event of failures. Imagine that we are trying to send a message to a friend, but for some reason it does not arrive. In this case, instead of immediately trying to send it again, the algorithm instructs us to wait a little while.

That is, after the first unsuccessful attempt, we wait for a while before making a second attempt. If that fails, we wait a little longer, and then try again. The term BackOff means that with each attempt, the waiting period gradually increases. The idea is to give the system or network time to recover from a failure. In addition, this approach prevents the avalanche effect.

That is, BackOff is the time that must pass after the pod has finished its work. Only after it has expired will it try to restart. The delay gives the system time to recover and fix the error. A set of such intervals “slows down” restarts.

Let's assume that the pod does not start immediately. Then the default restart timeout (in the kubelet configuration) is 10 seconds. Usually, it doubles after each attempt. That is, after the first attempt, the pod will wait 10 seconds, after the second – 20 seconds, then 40, 80, and so on.

BackOff-cycle of the pod

BackOff-cycle of the pod

Kubernetes Restart Policy

As mentioned above, Kubernetes attempts to restart a pod when it fails. Pods in Kubernetes are designed to be self-healing entities. This means that containers that experience errors or failures are automatically restarted.

The specific behavior is controlled by a configuration called restartPolicy in the pod specification. By setting the restart policy, we determine how Kubernetes will handle container failures. Possible values ​​are Always, OnFailure, and Never. The default value is Always.

An example of a pod specification with a K8s restart policy defined:

apiVersion: v1
kind: Pod
metadata:
  name: my-nginx
spec:
  containers:
  - name: nginx
    image: nginx:latest
    ports:
    - containerPort: 80
  restartPolicy: Always    #политика перезапуска

How to know if you're in a CrashLoop state

The status of pods can be checked using a simple kubectl command. An example of its output:

It is clear that my-nginx is not ready: its status is different from Ready and is listed as CrashLoopBackOff. The number of restarts is also visible.

What happens here is exactly what was said above: the pod crashes and tries to restart. While the pod is “resting”, you can try to determine the reason for its restarts.

For example, on the PerfectScale platform, you need to go to the Alerts tab to see critical alerts about Kubernetes resources and unusual activity in the system. You can also see a detailed alert summary for a specific tenant:

Common Causes of CrashLoopBackOff

1. Lack of resources

Memory allocation plays a critical role in ensuring smooth operation of Kubernetes deployments. If a pod's memory requirements are not met, a CrashLoopBackOff condition may occur.

For example, if an application “eats” more memory than it is allocated, the system kills it by OOM (Out Of Memory). The result is the CrashLoopBackOff state.

2. Problems related to images

  • Insufficient Permissions – If a container image does not have the necessary permissions to access resources, it may fail.

  • Incorrect container image – If a pod uses an incorrect image to run a container, it can lead to constant reboots and a CrashLoopBackOff state.

3. Configuration errors

  • Syntax errors or typos – When configuring pods, there may be errors such as typos in container names, image names, and environment variables. These can prevent containers from starting correctly.

  • Incorrectly set requests and resource limits – Errors in setting up requested resources (minimum required amount) and limits (maximum allowed amount) can lead to container failures.

  • Missing dependencies are dependencies that the user forgot to include in the pod spec.

4. Problems with external services

  • Network issues – If a container depends on some external service, such as a database, and that external service is unavailable or not running at startup, this can lead to CrashLoopBackOff.

  • If one of the external services is down and a container in a pod relies on it, it may cause the container to fail due to inability to connect.

5. Exceptions in applications

An error or exception in a containerized application can cause it to crash. The causes can be various: invalid input, insufficient resources, network issues, file permissions issues, misconfigured secrets and environment variables, or bugs in the code. If the application code does not have proper error handling mechanisms to gracefully catch and handle exceptions, it can lead to a CrashLoopBackOff condition in Kubernetes.

6. Liveness probes are not configured correctly

Liveness probes help ensure that the process in the container is running (not hung). If this happens, the container will be killed and restarted (if this is specified by the pod's restartPolicy ). A common mistake is to misconfigure the liveness probe, which results in the container restarting when there are temporary delays in responses, for example due to high load. Such a probe exacerbates the problem, rather than solving it.

Finding out the reasons behind CrashLoopBackOff

The previous section showed that a pod can end up in the CrashLoopBackOff state for a variety of reasons. Now let's talk about how to identify the specific cause. Troubleshooting typically begins with formulating possible scenarios, then identifying the root cause, gradually eliminating all irrelevant scenarios.

Let's assume that the team kubectl get pods showed the CrashLoopBackOff status for one or more pods:

$ kubectl get pods 
NAME                        READY   STATUS             RESTARTS         AGE
app                         1/1     Running            1 (3d12h ago)    8d
busybox                     0/1     CrashLoopBackOff   18 (2m12s ago)   70m 
hello-8n746                 0/1     Completed          0                8d
my-nginx-5c9649898b-ccknd   0/1     CrashLoopBackOff   17 (4m3s ago)    71m
my-nginx-7548fdb77b-v47wc   1/1     Running            0                71m

Let's try to establish the cause step by step:

1. Let's check the description of the pod

Team kubectl describe pod pod-name displays detailed information about specific pods and containers:

$ kubectl describe pod pod-name 

Name:           pod-name
Namespace:      default
Priority:       0
……………………
State:         Waiting
Reason:        CrashLoopBackOff
Last State:    Terminated
Reason:        StartError
……………………
Warning  Failed   41m (x13 over 81m)   kubelet  Error: container init was OOM-killed (memory limit too low?): unknown

Its output contains various information, such as:

  • State: Waiting

  • Reason: CrashLoopbackOff

  • Last state: Terminated

  • Reason: StartError

This is enough to make a couple of assumptions about the reasons. From the last lines of the output “kubelet Error: container init was OOM-killed (memory limit too low?)” it follows that the container does not start due to insufficient memory.

2. Let's check the pod logs

Logs contain detailed information about Kubernetes resources, about the container startup, about any problems that occurred along the way, about the resource termination or even about its successful completion.

You can check the pod logs using the following commands:

$ kubectl logs pod-name   # извлечь логи пода с единственным контейнером

View logs of a pod with multiple containers:

$ kubectl logs pod-name --all-conainers=true

View logs for a specific period of time. For example, to display logs for the last hour, run:

$ kubectl logs pod-name --since=1h

3. Let's look at the events

Events contain the latest information about Kubernetes resources. You can query events for a specific namespace or filter them by a specific workload:

$ kubectl events 
LAST SEEN               TYPE      REASON    OBJECT                          MESSAGE
4h43m (x9 over 10h)     Normal    BackOff   Pod/my-nginx-5c9649898b-ccknd   Back-off pulling image "nginx:latest"
3h15m (x11 over 11h)    Normal    BackOff   Pod/busybox                     Back-off pulling image "busybox"
40m (x26 over 13h)      Warning   Failed    Pod/my-nginx-5c9649898b-ccknd   Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown

An example of the command in action is shown above.

List recent events for all namespaces:

$ kubectl get events --all-namespaces

List events for a specific pod:

$ kubectl events --for pod/pod-name

4. Let's check the deployment logs:

$ kubectl logs deployment deployment-name 

Found 2 pods, using pod/my-nginx-7548fdb77b-v47wc
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Sourcing 

Deployment logs help you figure out why containers are failing and why a pod ends up in the CrashLoopBackOff state.

Conclusion

In this article, we talked about CrashLoopBackOff in detail. It is important to note that this is not an error in itself, but a condition that can have various causes. Analysis of logs and events for a specific pod will help to establish them.

P.S.

Read also in our blog:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *