Top 10 PromQL queries for monitoring Kubernetes
This article provides examples of popular Prometheus queries for monitoring Kubernetes…
If you are just starting to work with Prometheus and you are having difficulties creating PromQL queries, we recommend that you contact PromQL Getting Started Guide… We’ll skip theory here and get straight to practice.
The rating is based on the experience of the company Sysdigthat daily assists hundreds of clients in setting up monitoring of their clusters:
1. The number of pods in each namespace
Information about the number of pods in each namespace can be useful for detecting anomalies in the cluster, for example, too many pods in a separate namespace:
sum by (namespace) (kube_pod_info)
2. The number of containers without CPU limits in each namespace
It is important to correctly set the limits for optimizing application and cluster performance… This query finds containers without CPU limits:
count by (namespace)(sum by (namespace,pod,container)(kube_pod_container_info{container!=""}) unless sum by (namespace,pod,container)(kube_pod_container_resource_limits{resource="cpu"}))
3. The number of pods reboots in each namespace
With this request, you will get a list of pods that have been restarted. This is an important metric since a large number of pod reloads usually means CrashLoopBackOff:
sum by (namespace)(changes(kube_pod_status_ready{condition="true"}[5m]))
4. Pods in Not Ready status in each namespace
The request displays all the pods that have a problem. This may be the first step to localizing and fixing it:
sum by (namespace)(kube_pod_status_ready{condition="false"})
5. Exceeding Cluster Resources – CPU
Avoid situations where CPU limits exceed cluster resources. Otherwise, you can face the problem of CPU throttling… You can detect exceeding the cluster resource limits using the query:
sum(kube_pod_container_resource_limits{resource="cpu"}) - sum(kube_node_status_capacity_cpu_cores)
6. Exceeding Cluster Resources – Memory
If all Memory limits in total exceed the capacity of the cluster, then this can lead to PodEviction if there is not enough memory on the node. To check, use a PromQL query:
sum(kube_pod_container_resource_limits{resource="memory"}) - sum(kube_node_status_capacity_memory_bytes)
7. Number of healthy cluster nodes
The query will display the number of healthy cluster nodes:
sum(kube_node_status_condition{condition="Ready", status="true"}==1)
8. The number of cluster nodes that may not work correctly
Find cluster nodes that periodically change state from Ready to Not Ready:
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (node) > 2
9. Detect idle CPU cores
Planning resources for a Kubernetes cluster is not an easy task. This query will help you determine how many CPU cores are idle:
sum((rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[30m]) - on (namespace,pod,container) group_left avg by (namespace,pod,container)(kube_pod_container_resource_requests{resource="cpu"})) * -1 >0)
10. Detecting unused memory
This query will help reduce your costs by providing information about unused memory:
sum((container_memory_usage_bytes{container!="POD",container!=""} - on (namespace,pod,container) avg by (namespace,pod,container)(kube_pod_container_resource_requests{resource="memory"})) * -1 >0 ) / (1024*1024*1024)
Want to know more?
We recommend that you explore our PromQL cheat sheetto learn how to write more complex PromQL queries.
Also take advantage of the great collection Awesome Prometheus alerts collection… It includes several hundred Prometheus alert rules, you can check them to learn more about PromQL and Prometheus.