
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2025
Kubernetes is widely used for orchestrating containerized applications, and ensuring high availability is a key requirement. One critical aspect of maintaining a stable Kubernetes cluster is monitoring for pod failures and triggering alerts when issues arise. Detecting failing pods promptly allows teams to respond effectively, minimizing downtime and ensuring smooth operations.
In this tutorial, we’ll explore how to set up alerts in Kubernetes for pod failures. In addition, we’ll discuss key monitoring tools such as Prometheus and Alertmanager, various failure scenarios, best practices to manage alerts efficiently, and hands-on examples with code snippets to demonstrate alerting configurations. By the end of this guide, we’ll have a comprehensive understanding of monitoring Kubernetes pods and ensuring a resilient deployment environment.
In a complex distributed system like Kubernetes, pods can fail for various reasons, including resource constraints, misconfigurations, or infrastructure failures. Without an effective alerting system, identifying these failures can be challenging, potentially leading to service disruptions and degraded user experiences. By setting up a robust alerting mechanism, we can proactively monitor pod health, troubleshoot issues efficiently, and automate responses where possible.
Before diving into alert configurations, we need to understand how Kubernetes monitoring works. In essence, Kubernetes doesn’t provide built-in alerting mechanisms, but it does expose various metrics via the Metrics API and third-party monitoring solutions. To monitor Kubernetes effectively, we typically use:
By leveraging these tools, we create a robust monitoring pipeline that detects and responds to pod failures in real-time.
To monitor pod failures in Kubernetes, we utilize Prometheus and Alertmanager. Prometheus collects metrics, while Alertmanager handles the routing and notification of alerts.
We deploy Prometheus and Alertmanager using the Prometheus Operator, which simplifies configuration and management.
To get started, we install the Prometheus Operator:
$ kubectl create namespace monitoring
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring
Accordingly, this command installs Prometheus, Alertmanager, and related components in the monitoring namespace. Once deployed, Prometheus starts collecting metrics from the Kubernetes API, nodes, and pods, while Alertmanager handles alerts.
Moreover, we can verify Prometheus is running by port-forwarding its service:
$ kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090 -n monitoring
Finally, opening http://localhost:9090 in a browser displays the Prometheus UI.
Prometheus requires scraping configurations to collect relevant metrics. Next, let’s check an example configuration for scraping Kubernetes pod metrics:
- scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
This configuration ensures Prometheus scrapes only pods that have the proper annotations.
Pod failures can occur due to various reasons. In this section, we’ll cover some of the most common reasons that can lead to a pod failure in Kubernetes.
When a pod exceeds its assigned CPU or memory limits, it may be evicted or restarted. In particular, this can be caused by inefficient resource allocation, memory leaks, or unexpected traffic spikes.
Let’s have a look at a PromQL query to detect pods killed due to memory limits:
- alert: HighMemoryUsage
expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} has been OOMKilled"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} was terminated due to out-of-memory."
This is a crucial alert for monitoring the stability and resource utilization of Kubernetes applications. In addition, OOMKilled events indicate that pods aren’t allocated sufficient memory, which can lead to application crashes and service disruptions.
Kubernetes relies on readiness and liveness probes to determine whether a pod is healthy. If a probe repeatedly fails, the pod might restart continuously.
Here’s an example PromQL query for detecting failing readiness probes:
- alert: ReadinessProbeFailure
expr: kube_pod_container_status_ready == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is failing readiness probes"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not passing readiness checks."
Readiness probes are used by Kubernetes to determine when a container is ready to receive traffic. If a readiness probe fails, Kubernetes will remove the pod from the service’s endpoint list, preventing traffic from being routed to it. Hence, this helps ensure that only healthy pods are serving requests.
A consistent readiness probe failure can indicate various problems, like application startup issues, dependency problems, configuration errors, and resource limitations. Therefore, this alert is valuable for detecting and addressing potential issues that could impact application availability.
Network connectivity issues between pods or nodes can cause failures. Here’s an example alert:
- alert: NetworkUnavailable
expr: rate(kube_pod_container_status_restarts_total[5m]) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High pod restart count detected"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently, which may indicate network instability."
Moreover, this Prometheus alerting rule aims to detect pods that are restarting frequently, with the assumption that this could be indicative of network instability, which is a critical issue.
In this section, we’ll delve even deeper into more advanced alerting strategies that can detect anomalies. For example, we’ll discuss the dynamic alerting and the multi-clustering alerting methods.
Instead of static thresholds, we can use anomaly detection techniques to identify unusual behavior. To tackle this, Prometheus provides the Holt-Winters algorithm for trend-based alerts:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage detected"
description: "CPU usage on node {{ $labels.instance }} exceeds 80% for 5 minutes."
This Prometheus alerting rule aims to detect anomalous CPU usage on a node by predicting future CPU utilization and triggering an alert if the prediction exceeds 80%.
For organizations using multiple Kubernetes clusters, we can configure Prometheus Federation to aggregate metrics across clusters:
scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
match[]:
- '{__name__=~"kube.*"}'
static_configs:
- targets:
- 'prometheus-cluster1:9090'
- 'prometheus-cluster2:9090'
This Prometheus scrape configuration sets up a “federate” job to collect metrics from two separate Prometheus servers, prometheus-cluster1:9090 and prometheus-cluster2:9090. Additionally, it specifically targets the /federate endpoint on these servers, requesting only metrics with names starting with “kube,” effectively pulling in Kubernetes-related metrics. By setting honor_labels: true, it ensures that the original labels from the source Prometheus servers are retained.
Finally, this allows for centralized monitoring of Kubernetes metrics from multiple clusters within a single Prometheus instance.
In this article, we explained how setting up alerts for failing Kubernetes pods is crucial for maintaining cluster health.
By leveraging Prometheus and Alertmanager, we can detect and respond to failures efficiently. Moreover, defining precise alerting rules, routing notifications effectively, and applying best practices ensures smooth operations while minimizing downtime.
With proactive monitoring in place, we not only improve system reliability but also reduce operational overhead, making Kubernetes deployments more resilient and scalable.