Cloud Native 35 min read

Why Does Container Memory Hover at 99%? Decoding RSS, PageCache, and cgroup Limits in K8s

This article explains why Kubernetes containers often show near‑full memory usage by exploring Linux process address spaces, the distinction between RSS and page cache, how cgroup statistics are collected, and provides step‑by‑step C and Go examples to reproduce and diagnose the behavior.

dbaplus Community
dbaplus Community
dbaplus Community
Why Does Container Memory Hover at 99%? Decoding RSS, PageCache, and cgroup Limits in K8s

Linux Process Memory Allocation

Linux virtual address space consists of kernel space, user space, shared libraries, heap, BSS, data, text, and stack. Each region is represented by a vm_area_struct (VMA) linked through the mm_struct. A segmentation fault occurs when an accessed address lies outside any VMA; the layout can be inspected via /proc/<pid>/maps.

malloc and physical memory

malloc

expands the heap VMA but does not allocate physical pages immediately. Physical pages are only allocated on first write (page‑fault handling). The following C program demonstrates that after malloc the RSS stays low, while after memset the RSS jumps to the size of the allocated region, generating one minor page fault per 4 KiB page.

#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/resource.h>
#include <stdio.h>
#include <time.h>

const int64_t GB = 1024*1024*1024;
const int64_t MB = 1024*1024;
const int64_t KB = 1024;

void max_rss(){
    struct rusage r;
    getrusage(RUSAGE_SELF, &r);
    printf("Current max rss %ld kb, minor %ld, major %ld
", r.ru_maxrss, r.ru_minflt, r.ru_majflt);
}

int main(){
    printf("Pid %lu
", getpid());
    int number = 128; // MB
    void *ptr = malloc(number*MB);
    if (!ptr){ perror("Out of memory"); exit(EXIT_FAILURE); }
    printf("Allocated %d MB, ptr %p
", number, ptr);
    max_rss();
    sleep(60);
    memset(ptr, 0, number*MB);
    printf("Used %d MB via memset
", number);
    max_rss();
    sleep(60);
    free(ptr);
    printf("Freed ptr %p
", ptr);
    max_rss();
    sleep(60);
    return 0;
}

Typical output shows RSS increasing from a few megabytes to ~128 MiB after memset, corresponding to 32 768 minor page faults (128 MiB / 4 KiB).

container_memory_rss in Kubernetes

Process RSS = sum of anonymous pages, file‑backed pages, and shared memory. In a container, the metric container_memory_rss is taken from the cgroup file memory.stat field rss, which counts **only anonymous pages** ( MEMCG_RSS). File‑backed and shared pages are accounted in MEMCG_CACHE.

Example program allocating three kinds of memory:

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/shm.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

#define FILE_SIZE (4*1024*1024)   // 4 MiB
#define ANON_SIZE (8*1024*1024)   // 8 MiB
#define SHM_SIZE  (10*1024*1024) //10 MiB

void allocate_filepages(){
    int fd = open("tempfile", O_RDWR|O_CREAT|O_TRUNC, 0600);
    ftruncate(fd, FILE_SIZE);
    void *p = mmap(NULL, FILE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    memset(p, 0, FILE_SIZE);
    printf("Allocated %d MiB file‑mapped
", FILE_SIZE/(1024*1024));
}

void allocate_anonpages(){
    void *p = malloc(ANON_SIZE);
    memset(p, 0, ANON_SIZE);
    printf("Allocated %d MiB anonymous
", ANON_SIZE/(1024*1024));
}

void allocate_shm(){
    int id = shmget(IPC_PRIVATE, SHM_SIZE, IPC_CREAT|0600);
    void *p = shmat(id, NULL, 0);
    memset(p, 0, SHM_SIZE);
    printf("Allocated %d MiB shared memory
", SHM_SIZE/(1024*1024));
}

int main(){
    printf("Process %d
", getpid());
    allocate_filepages();
    allocate_anonpages();
    allocate_shm();
    sleep(3600);
    return 0;
}

Running the program shows top -p $pid reporting a RES larger than the explicit allocations because the kernel also counts stack, text, and page‑table memory. Inspecting /proc/<pid>/status reveals the split into RssAnon, RssFile, and RssShmem. The cgroup memory.stat reports rss (anonymous only) and cache (file‑backed + shared).

container_memory_cache

Page cache stores file data in RAM to accelerate I/O. Writing a 100 MiB file and checking free -m or /proc/meminfo shows the Cached field increase by roughly the file size. The first read creates Inactive(File) pages; a second read promotes them to Active(File). In Kubernetes, container_memory_cache mirrors the cgroup cache statistic ( MEMCG_CACHE), which aggregates page cache and shared memory. Experiments show that cache can dominate container memory usage while RSS stays low.

container_memory_mapped_file

Memory‑mapped files contribute to container_memory_mapped_file, which is the NR_FILE_MAPPED entry in memory.stat. The following program creates a 100 MiB file, maps it, and overwrites its content:

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>

#define FILE_PATH "/root/test.txt"
#define FILE_SIZE (100*1024*1024)

int main(){
    int fd = open(FILE_PATH, O_WRONLY|O_CREAT|O_TRUNC, 0600);
    char *buf = malloc(FILE_SIZE);
    memset(buf, 'A', FILE_SIZE);
    write(fd, buf, FILE_SIZE);
    free(buf);
    close(fd);

    fd = open(FILE_PATH, O_RDWR);
    char *map = mmap(NULL, FILE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    memset(map, 'B', FILE_SIZE);
    munmap(map, FILE_SIZE);
    close(fd);
    return 0;
}

After execution, VmRSS grows by ~100 MiB, confirming that mmap‑backed file pages are counted in the process RSS but **not** in the container‑level container_memory_rss (they appear in container_memory_mapped_file).

tmpfs, emptyDir and System V IPC Shared Memory

An emptyDir volume with medium: Memory is backed by tmpfs; data resides in RAM and contributes to the pod’s memory cgroup. Without a size limit, such volumes can consume the entire node memory.

System V IPC shared memory ( shmget / shmat) also uses tmpfs. A simple writer/reader allocating 36 MiB demonstrates the allocation ( ipcs -m) and the need for ipcrm to release it after the processes exit.

Monitoring Practices

Self‑monitoring with getrusage

Programs can call syscall.Getrusage (Go) or getrusage(2) (C) to obtain max RSS, page‑fault counts, CPU time, and context switches, enabling internal logging of resource usage.

package main
import (
    "fmt"
    "syscall"
    "time"
)
func main(){
    var u syscall.Rusage
    syscall.Getrusage(syscall.RUSAGE_SELF, &u)
    fmt.Printf("Max RSS: %v
", u.Maxrss)
    // simulate load
    for i:=0; i<1e8; i++ { _ = i*i }
    time.Sleep(2*time.Second)
    syscall.Getrusage(syscall.RUSAGE_SELF, &u)
    fmt.Printf("After sleep Max RSS: %v
", u.Maxrss)
}

Top and PID namespaces

Inside a container, top shows host‑wide statistics because /proc reflects the whole node. Sharing the PID namespace between containers (or using a privileged sidecar) allows a single top to see all container processes.

Choosing an alert metric

Prometheus alerts typically use container_memory_working_set_bytes, which subtracts inactive_file (reclaimable page cache) from total usage, providing a more accurate view of memory pressure.

100 * container_memory_working_set_bytes{container="$container", pod="$pod", namespace="$ns"}
/ kube_pod_container_resource_limits{resource="memory", container="$container", pod="$pod", namespace="$ns"} %

Conclusion

The article explains how Linux handles page faults, how RSS is calculated, and how cgroup memory statistics ( rss, cache, mapped_file) differ from process‑level metrics. It shows the impact of page cache, tmpfs, and shared memory on container memory accounting, and provides practical techniques (code examples, getrusage, cgroup inspection) for diagnosing unexpected memory usage and avoiding OOM kills in Kubernetes environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesPageCacheMemorycgroupRSScontainer monitoring
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.