Why Kubernetes Monitoring Differs from VM Metrics: CPU, Memory, Disk, Network
This article compares Kubernetes pod monitoring metrics with traditional KVM/VM metrics across CPU, memory, disk, and network, explaining the underlying reasons for the differences and offering guidance on interpreting the data for effective performance troubleshooting.
Applying Kubernetes monitoring has been underway for some time, and the monitoring system now provides Kubernetes pod‑related metrics and alert rules.
Because Kubernetes and traditional physical/virtual machines run in fundamentally different environments, the metrics they expose also differ; despite efforts to unify them, users still encounter feedback on Kubernetes monitoring metrics.
This article explains the differences between physical/virtual machines (referred to as KVM) and Kubernetes from the perspectives of CPU, memory, disk, and network, helping users understand the underlying principles when using the monitoring product.
All data collected by the monitoring platform originates from Kubernetes native metrics, so it is inevitably constrained by the characteristics of Kubernetes native interfaces.
CPU differences are the most significant, dictated by Kubernetes' technical nature.
Memory differences exist but can largely be aligned with the KVM stack.
Disk and network differences are minimal, with little extra learning cost.
CPU
In KVM scenarios, users focus on CPU usage rate and CPU load:
High CPU load with low usage usually indicates a bottleneck in disk I/O.
High CPU usage with load far exceeding core count signals severe CPU resource shortage.
In Kubernetes, users monitor CPU usage rate and CPU throttling time:
High usage (near or slightly above 100%) combined with large throttling time indicates severe CPU shortage, requiring increased request or limit.
The differences arise because Kubernetes and KVM use different CPU isolation mechanisms and expose metrics differently.
The monitoring system provides two CPU‑related metrics and their corresponding data points:
The diagram below shows a throttled application where CPU usage exceeds 100%.
CPU Usage Rate
For a single CPU core, time is divided into three parts: user code execution, kernel code execution, and idle time.
In KVM, CPU usage rate is calculated as (user time + kernel time) / total time.
In Kubernetes, a pod does not own an exclusive core, so the formula changes; a pod with a CPU limit of 4 can use up to 4 seconds of CPU per second, and a pod using 0.5 seconds per second represents 50% of a core.
Kubernetes does not expose a native "usage rate" concept, but the monitoring system derives pod CPU usage rate as usage / limit.
Due to limited granularity of CPU limit implementation and measurement error, CPU usage rate may spike above 100% under extreme load.
CPU Load
CPU load measures system load using the number of runnable threads, including running threads and those in uninterruptible sleep (typically I/O).
Kubernetes provides a cpu_load metric that only includes running threads, losing the ability to detect I/O‑bound bottlenecks.
Because of differing CPU usage models, Kubernetes also offers a "CPU throttling time" metric, which captures the time a pod is limited by its CFS quota.
When throttling time is high, the pod’s CPU resources are insufficient and more resources should be allocated.
Memory
KVM and Kubernetes both use memory usage rate, but they differ on what constitutes used memory.
KVM lacks a clear standard; monitoring uses total ‑ available, considering cache/buffer/slab as part of usage.
Kubernetes does not provide an available metric, so RSS (resident set size) is used as used memory.
The monitoring system offers several memory‑related metrics, illustrated below:
Linux
freecommand shows used, cache/buffer, and available columns.
used : includes process memory and cache/buffer.
cache/buffer : cached data.
available : estimated free memory considering cache/buffer and unreclaimable slab.
Kubernetes exposes three memory values:
MemUsed : similar to Linux used, includes cache.
WorkingSet : excludes cold cache data, slightly smaller than MemUsed.
RSS : excludes cache, representing pure resident memory.
In practice, WorkingSet often appears high (around 90% for typical web apps), so the monitoring system chooses RSS as the primary memory usage metric, while recommending users consult other metrics when diagnosing performance issues.
Disk / Network
Based on Linux cgroup isolation, disk and network metrics differ little between Kubernetes and KVM. Web applications usually care about disk usage, but Kubernetes clusters are typically disk‑less unless using persistent volumes, so disk space is not a concern.
For disk, the focus is on write performance metrics (illustrated below):
For network, similar to KVM, the focus is on traffic and packet loss metrics (illustrated below):
Source: https://tech.kujiale.com/jian-kong-pod-shi-wo-men-zai-jian-kong-shi-yao/
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.