Baeldung Pro – Ops – NPI EA (cat = Baeldung on Ops)
announcement - icon

Learn through the super-clean Baeldung Pro experience:

>> Membership and Baeldung Pro.

No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.

1. Overview

Mismanagement of memory usage in a Kubernetes cluster can lead to unexpected pod eviction. In the worst case, mismanaging memory might cause cluster-wide instability. Fortunately, Kubernetes provides several metrics for tracking the memory usage of the pods in the cluster.

In this tutorial, we’ll learn about the different memory metrics that a Kubernetes cluster collects. Importantly, we’ll learn to interpret the readings of each metric.

2. Kubernetes Metrics

Kubernetes uses a hierarchical approach to collect and expose the containers’ metrics. At the lowest level, cAdvisor (Container Advisor) runs as part of the kubelet on each node and collects the container’s resource statistics.

The process collects the statistics by hooking into the container runtime on the node. These include, but aren’t limited to, CPU, memory, filesystems, and network usage. The cAdvisor process exposes these metrics through a REST API on the node itself.

Then, the metrics-server of a Kubernetes cluster collects and aggregates the statistics from the cAdvisor‘s process in each node. These aggregated metrics will then be used for horizontal pod scaling, depending on resource usage. In addition, the kubectl top command presents the statistics aggregated on the metrics-server.

3. Linux Kernel Memory Terminology

The kubelet process of a Kubernetes cluster relies on a container runtime present in the node to run the workload. Then, almost all of the container runtime relies on Linux’s cgroup mechanism to run what we know as a container. As such, the memory readings we’re collecting are what the cgroup mechanism provides.

Therefore, having a solid grasp of some of the Linux memory terminology makes it easier to understand the meaning of each metric we’re collecting.

3.1. Anonymous Memory

Anonymous memory refers to memory that isn’t backed by a filesystem. For example, the stack and heap memory a program creates are anonymous memory from the eye of the Linux kernel. Additionally, when processes create a memory-mapped file, the amount of memory spent is considered anonymous memory.

3.2. Page Cache

When a program reads data from the disk, the content is cached in the memory portion known as the page cache. With caching, the subsequent read of the same file doesn’t go through the disk, which is more expensive than a memory lookup.

Importantly, the page cache memory is evictable when there’s memory pressure. In other words, it’s perfectly fine to have a process that uses a lot of memory if most is spent caching files.

4. Container Memory Metrics

The metrics-server of a Kubernetes cluster exposes many container-specific memory metrics. These metrics provide insight into the stability of our cluster. Importantly, the metrics provide diagnostic information when the containers on the cluster are facing memory-related problems.

4.1. container_memory_usage_bytes

The container_memory_usage_bytes metric represents the total memory used by the containers. This metric encompasses the consumption of all types of memory. For example, it accounts for the container’s anonymous and page cache memory.

Despite its comprehensiveness, container_memory_usage_bytes isn’t indicative of a potentially problematic situation. This is because the reading contains reclaimable memory, such as the page cache. Therefore, a high reading on the container_memory_usage_bytes isn’t necessarily a cause for concern.

4.2. container_memory_cache

The container_memory_cache measures the amount of memory the container uses for caching purposes. This container uses this portion of the memory to speed up I/O by reducing the amount of disk access.

The reading on the container_memory_cache is again informative and shouldn’t be a source of concern, as the memory can be reclaimed should there be any memory pressure on the container.

4.3. container_memory_mapped_file

In Linux, a memory-mapped file refers to a mechanism that allows a file or a portion of it to be mapped directly into the process’ virtual memory space. This mapping allows the process to access the files as if they are in the program’s memory. Crucially, we can use the mmap method to create a memory-mapped file.

The container_memory_mapped_file shows the amount of memory the container uses for mapping files to the memory.

4.4. container_memory_rss

The container_memory_rss metric measures the total amount of anonymous memory and swap cache memory used by the container. Importantly, we shouldn’t confuse the container_memory_rss metric with the true resident set size used by the container. The true resident set size of the container is the sum of the container_memory_rss value plus the size of the memory-mapped file.

The memory measured by this metric is important for monitoring, as it contains memory that can’t be easily evicted. Unlike page caches, anonymous memory can’t be evicted without corrupting the application state.

4.5. container_memory_working_set_bytes

The container_memory_working_set_bytes metric, as the name implies, tells us the memory the container needs to function. Specifically, the container_memory_working_set_bytes measures the anonymous memory and active page cache. In other words, the container_memory_working_set_bytes is the container_memory_usage_bytes subtracted by the amount of memory used for caching inactive files.

Contrary to the container_memory_rss, the container_memory_working_set_bytes reading is closer to the memory the application utilizes actively. This is because it contains the portion of memory occupied by the active page cache, which is something the application constantly reads from.

5. Metrics to Use for Monitoring Out of Memory (OOM) Events

When we specify a memory limit for our container, the kernel’s cgroup mechanism enforces the limit by constantly checking on the container’s memory consumption. When the memory usage grows near the limit, the cgroup memory controller tries to reclaim memory by evicting the page caches of the container.

If the memory pressure persists despite evicting page caches, the process invokes the Out of Memory (OOM) killer to kill off the container. For pods that are OOM killed, we’ll see that their status is OOMKilled in the output of kubectl get:

$ kubectl get pods
NAME                            READY   STATUS      RESTARTS   AGE
memory-demo-7dc48cf8b4-2xrfs   1/1     OOMKilled  0          3m

As a preventative measure, we can monitor the memory metrics of the containers in our cluster and take preemptive measures to prevent OOM kill events. Specifically, we can rely on the container_memory_working_set_bytes metric as the indicator for a possible OOM kill event.

Concretely, when the container_memory_working_set continuously increases despite the kernel’s attempt to clear off the active page cache, we can conclude that there’s a growth in unclaimable memory. When the unclaimable memory grows near the limit, the OOM killer kills off the container to enforce the limit we’ve set.

6. Conclusion

In this article, we’ve first looked at how a Kubernetes cluster collects and aggregates resource usage statistics from all the nodes. Then, we’ve learned about the various memory-related metrics the metrics-server component of a Kubernetes cluster offers. Finally, we showed how container_memory_working_set_bytes is the metric we can rely on to detect potential OOM kill events.