Avoiding CPU Throttling in a Containerized Environment with Cgroups and Cpusets
The article explains how Uber replaced CPU quota enforcement with cpusets (CPU pinning) for stateful workloads, reducing P99 latency, improving performance consistency, and saving about 11% of cores across the cluster by eliminating throttling.
Avoiding CPU Throttling in a Containerized Environment – translated from the original article by Joakim Recht and Yury Vostrikov.
At Uber, all stateful workloads run on a shared, large‑scale container platform. These workloads include MySQL, Apache Cassandra, Elasticsearch, Apache Kafka, HDFS, Redis, Docstore, Schemaless, and many of them often share the same physical host.
With 65,000 physical machines, 2.4 million cores, and 200,000 containers, improving utilization to cut costs is a constant effort, but CPU throttling has recently hindered those gains.
The root cause is how the Linux kernel allocates CPU time to processes. This article describes switching from CPU quota to cpusets (also known as CPU pinning), which trades a slight increase in P50 latency for a significant reduction in P99 latency, ultimately allowing a cluster‑wide core reduction of about 11%.
Cgroups, Quotas, and Cpusets
CPU quotas and cpusets are scheduling features of the Linux kernel. Linux isolates resources via cgroups, which all container platforms rely on. Typically, a container maps to a single cgroup that controls the resources of every process inside the container.
There are two types of cgroup controllers for CPU isolation: the cpu controller (quota‑based) and the cpuset controller (pinning‑based). Both limit the CPUs a group of processes may use, but they do so in different ways.
CPU Quota
The cpu controller enforces isolation by assigning a quota. For a given CPU set you specify the proportion of a core you want to allow. The quota for a typical 100 ms period is calculated as:
quota = core_count × period
In the example above, a container that needs 2 cores requires 200 ms of CPU time per 100 ms period.
CPU Quota and Throttling
Because containers often run many threads, the quota‑based approach can cause them to exhaust their quota quickly, leading to throttling for the remainder of the period. The diagram below illustrates this effect:
For latency‑sensitive containers this is problematic: a request that normally completes in a few milliseconds can be delayed beyond 100 ms when throttling occurs.
A simple fix is to allocate more CPU time, but that is expensive at scale. Removing isolation altogether is also undesirable because one process could starve all others on the same host.
Using Cpusets to Avoid Throttling
The cpuset controller uses CPU pinning instead of quotas—it restricts a container to a specific set of cores. This makes it possible to distribute containers across different cores so that each core serves only one container, achieving full isolation without quotas or throttling. The following diagram shows the idea:
Two containers run on two distinct core groups; each may use its assigned cores fully but cannot use cores outside its set.
Enabling cpusets in a production database cluster eliminated all throttling, as shown in the next figure. As expected, the P99 latency became much more stable and overall latency dropped by roughly 50% because the severe throttling spikes disappeared.
It is worth noting that cpusets also have a minor downside: P50 latency typically increases slightly because the workload can no longer burst onto unassigned cores, bringing P50 and P99 latencies closer together—a generally desirable trade‑off.
Allocating CPUs
To use cpusets, a container must be bound to specific cores. Correct core allocation requires knowledge of modern CPU topology; a poor allocation can cause severe performance degradation.
CPU topology is typically built as follows:
A physical machine can have multiple CPU sockets.
Each socket has its own L3 cache.
Each CPU contains multiple cores.
Each core has its own L2/L1 caches.
Each core may support hyper‑threading.
Hyper‑threads are usually counted as cores, but allocating two hyper‑threads instead of one yields only about a 1.3× performance gain.
Because core numbers are not always contiguous or deterministic, the scheduler must read the exact hardware topology (e.g., from /proc/cpuinfo ) and allocate physically close cores. An example of a non‑contiguous topology is shown below:
Using this information we can allocate cores that are physically close to each other:
Drawbacks and Limitations
While cpusets solve most latency issues, they have several trade‑offs:
Cannot allocate fractional cores. This is not a problem for large database processes, but it limits the number of containers to the number of physical cores.
System‑wide processes can still steal CPU time. Services running directly on the host (systemd, kernel workers, etc.) need CPU cycles; they could be pinned to a limited core set, but that is complex.
Fragmentation requires periodic defragmentation. Over time, free cores become scattered, requiring migration of processes to create contiguous blocks. Moving a process between sockets can increase memory latency; a follow‑up article discusses mitigation.
No burst limits. In some cases you may want a container to temporarily use unassigned cores for a burst; this can be achieved by combining cpusets with quotas or by allowing multiple containers to share a core.
Conclusion
Switching stateful workloads to cpusets was a major improvement for Uber. It delivered more stable database‑level latency, saved roughly 11% of cores by eliminating over‑provisioning for throttling spikes, and ensured consistent performance across hosts. Kubernetes also supports cpusets via its static CPU management policy.
Details of Uber’s testing methodology for quotas and cpusets can be found in the appendix.
Reference Links
[1]
Original article: https://eng.uber.com/avoiding-cpu-throttling-in-a-containerized-environment/
[2]
Related article: https://www.kernel.org/doc/html/latest/vm/numa.html
[3]
Kubernetes static CPU policy: https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy
[4]
Appendix: https://gist.github.com/ubermunck/2f116b7817812ae6255d19a4e10242f4
Join the Cloud‑Native Community Group:
The Cloud‑Native technology community has over 20 technical groups. To chat with experts or become a volunteer, add the assistant’s WeChat:
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.