Why Does Container Memory Hover at 99%? Decoding RSS, PageCache, and cgroup Limits in K8s
This article explains why Kubernetes containers often show near‑full memory usage by exploring Linux process address spaces, the distinction between RSS and page cache, how cgroup statistics are collected, and provides step‑by‑step C and Go examples to reproduce and diagnose the behavior.
Linux Process Memory Allocation
Linux virtual address space consists of kernel space, user space, shared libraries, heap, BSS, data, text, and stack. Each region is represented by a vm_area_struct (VMA) linked through the mm_struct. A segmentation fault occurs when an accessed address lies outside any VMA; the layout can be inspected via /proc/<pid>/maps.
malloc and physical memory
mallocexpands the heap VMA but does not allocate physical pages immediately. Physical pages are only allocated on first write (page‑fault handling). The following C program demonstrates that after malloc the RSS stays low, while after memset the RSS jumps to the size of the allocated region, generating one minor page fault per 4 KiB page.
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/resource.h>
#include <stdio.h>
#include <time.h>
const int64_t GB = 1024*1024*1024;
const int64_t MB = 1024*1024;
const int64_t KB = 1024;
void max_rss(){
struct rusage r;
getrusage(RUSAGE_SELF, &r);
printf("Current max rss %ld kb, minor %ld, major %ld
", r.ru_maxrss, r.ru_minflt, r.ru_majflt);
}
int main(){
printf("Pid %lu
", getpid());
int number = 128; // MB
void *ptr = malloc(number*MB);
if (!ptr){ perror("Out of memory"); exit(EXIT_FAILURE); }
printf("Allocated %d MB, ptr %p
", number, ptr);
max_rss();
sleep(60);
memset(ptr, 0, number*MB);
printf("Used %d MB via memset
", number);
max_rss();
sleep(60);
free(ptr);
printf("Freed ptr %p
", ptr);
max_rss();
sleep(60);
return 0;
}Typical output shows RSS increasing from a few megabytes to ~128 MiB after memset, corresponding to 32 768 minor page faults (128 MiB / 4 KiB).
container_memory_rss in Kubernetes
Process RSS = sum of anonymous pages, file‑backed pages, and shared memory. In a container, the metric container_memory_rss is taken from the cgroup file memory.stat field rss, which counts **only anonymous pages** ( MEMCG_RSS). File‑backed and shared pages are accounted in MEMCG_CACHE.
Example program allocating three kinds of memory:
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/shm.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#define FILE_SIZE (4*1024*1024) // 4 MiB
#define ANON_SIZE (8*1024*1024) // 8 MiB
#define SHM_SIZE (10*1024*1024) //10 MiB
void allocate_filepages(){
int fd = open("tempfile", O_RDWR|O_CREAT|O_TRUNC, 0600);
ftruncate(fd, FILE_SIZE);
void *p = mmap(NULL, FILE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
memset(p, 0, FILE_SIZE);
printf("Allocated %d MiB file‑mapped
", FILE_SIZE/(1024*1024));
}
void allocate_anonpages(){
void *p = malloc(ANON_SIZE);
memset(p, 0, ANON_SIZE);
printf("Allocated %d MiB anonymous
", ANON_SIZE/(1024*1024));
}
void allocate_shm(){
int id = shmget(IPC_PRIVATE, SHM_SIZE, IPC_CREAT|0600);
void *p = shmat(id, NULL, 0);
memset(p, 0, SHM_SIZE);
printf("Allocated %d MiB shared memory
", SHM_SIZE/(1024*1024));
}
int main(){
printf("Process %d
", getpid());
allocate_filepages();
allocate_anonpages();
allocate_shm();
sleep(3600);
return 0;
}Running the program shows top -p $pid reporting a RES larger than the explicit allocations because the kernel also counts stack, text, and page‑table memory. Inspecting /proc/<pid>/status reveals the split into RssAnon, RssFile, and RssShmem. The cgroup memory.stat reports rss (anonymous only) and cache (file‑backed + shared).
container_memory_cache
Page cache stores file data in RAM to accelerate I/O. Writing a 100 MiB file and checking free -m or /proc/meminfo shows the Cached field increase by roughly the file size. The first read creates Inactive(File) pages; a second read promotes them to Active(File). In Kubernetes, container_memory_cache mirrors the cgroup cache statistic ( MEMCG_CACHE), which aggregates page cache and shared memory. Experiments show that cache can dominate container memory usage while RSS stays low.
container_memory_mapped_file
Memory‑mapped files contribute to container_memory_mapped_file, which is the NR_FILE_MAPPED entry in memory.stat. The following program creates a 100 MiB file, maps it, and overwrites its content:
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#define FILE_PATH "/root/test.txt"
#define FILE_SIZE (100*1024*1024)
int main(){
int fd = open(FILE_PATH, O_WRONLY|O_CREAT|O_TRUNC, 0600);
char *buf = malloc(FILE_SIZE);
memset(buf, 'A', FILE_SIZE);
write(fd, buf, FILE_SIZE);
free(buf);
close(fd);
fd = open(FILE_PATH, O_RDWR);
char *map = mmap(NULL, FILE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
memset(map, 'B', FILE_SIZE);
munmap(map, FILE_SIZE);
close(fd);
return 0;
}After execution, VmRSS grows by ~100 MiB, confirming that mmap‑backed file pages are counted in the process RSS but **not** in the container‑level container_memory_rss (they appear in container_memory_mapped_file).
tmpfs, emptyDir and System V IPC Shared Memory
An emptyDir volume with medium: Memory is backed by tmpfs; data resides in RAM and contributes to the pod’s memory cgroup. Without a size limit, such volumes can consume the entire node memory.
System V IPC shared memory ( shmget / shmat) also uses tmpfs. A simple writer/reader allocating 36 MiB demonstrates the allocation ( ipcs -m) and the need for ipcrm to release it after the processes exit.
Monitoring Practices
Self‑monitoring with getrusage
Programs can call syscall.Getrusage (Go) or getrusage(2) (C) to obtain max RSS, page‑fault counts, CPU time, and context switches, enabling internal logging of resource usage.
package main
import (
"fmt"
"syscall"
"time"
)
func main(){
var u syscall.Rusage
syscall.Getrusage(syscall.RUSAGE_SELF, &u)
fmt.Printf("Max RSS: %v
", u.Maxrss)
// simulate load
for i:=0; i<1e8; i++ { _ = i*i }
time.Sleep(2*time.Second)
syscall.Getrusage(syscall.RUSAGE_SELF, &u)
fmt.Printf("After sleep Max RSS: %v
", u.Maxrss)
}Top and PID namespaces
Inside a container, top shows host‑wide statistics because /proc reflects the whole node. Sharing the PID namespace between containers (or using a privileged sidecar) allows a single top to see all container processes.
Choosing an alert metric
Prometheus alerts typically use container_memory_working_set_bytes, which subtracts inactive_file (reclaimable page cache) from total usage, providing a more accurate view of memory pressure.
100 * container_memory_working_set_bytes{container="$container", pod="$pod", namespace="$ns"}
/ kube_pod_container_resource_limits{resource="memory", container="$container", pod="$pod", namespace="$ns"} %Conclusion
The article explains how Linux handles page faults, how RSS is calculated, and how cgroup memory statistics ( rss, cache, mapped_file) differ from process‑level metrics. It shows the impact of page cache, tmpfs, and shared memory on container memory accounting, and provides practical techniques (code examples, getrusage, cgroup inspection) for diagnosing unexpected memory usage and avoiding OOM kills in Kubernetes environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
