Eliminating Container CPU Throttling: Uber’s Switch to cpusets Saves 11% of Cores
Uber reduced cluster-wide CPU allocation by 11% and stabilized P99 latency by replacing CPU quota throttling with cpusets, a CPU‑pinning technique that isolates containers on specific cores, eliminating throttling while only slightly increasing P50 latency.
Background
Uber runs stateful services such as MySQL, Cassandra, Elasticsearch, Kafka, HDFS, Redis, Docstore, and Schemaless on a shared container platform that spans 65,000 hosts, 2.4 million CPU cores and 200,000 containers. High utilization is required to keep costs low, but CPU‑quota throttling limited further gains.
Cgroups, CPU Quota, and Cpusets
Linux isolates resources with cgroups . Each container is placed in its own cgroup. Two CPU‑related controllers exist:
cpu – enforces a time‑based quota (quota/period) for the group.
cpuset – pins the group to a specific set of logical CPUs.
The quota is calculated as quota = core_count × period. The default period is 100 ms, so a container that needs two full cores receives a quota of 200 ms per period.
Impact of CPU Quota Throttling
Multithreaded workloads quickly consume their quota. When the quota is exhausted, the kernel throttles the cgroup for the remainder of the period, causing latency spikes. Requests that normally finish in a few milliseconds can exceed 100 ms. Raising the quota per container eliminates throttling but increases the overall CPU allocation, which is expensive at Uber’s scale.
Using Cpusets for Isolation
Cpusets assign a container to a fixed list of logical CPUs. The container can use the full capacity of those CPUs without any quota enforcement, eliminating throttling. In production, enabling cpusets removed all throttling events for a database cluster and reduced the 99th‑percentile (P99) latency by roughly 50 % while keeping the same throughput.
CPU Topology and Core Assignment
Correct core selection requires awareness of modern CPU topology:
Multiple physical CPU sockets per host, each with its own L3 cache.
Each socket contains several cores.
Each core has private L2/L1 caches.
Hyper‑threading exposes logical CPUs; two threads on the same core share most caches.
Non‑contiguous or non‑deterministic core IDs can degrade performance. Uber’s scheduler parses /proc/cpuinfo to obtain the exact hardware layout and prefers physically adjacent cores to minimise cross‑socket traffic.
Implementation Steps
Identify the set of logical CPUs to allocate (e.g., 0‑3,8‑11 for two adjacent cores on each socket).
Create a cpuset cgroup for the container, e.g.:
mkdir -p /sys/fs/cgroup/cpuset/db1
echo 0-3 > /sys/fs/cgroup/cpuset/db1/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/db1/cpuset.memsMove the container’s processes into the cgroup (Docker example):
docker run --cpuset-cpus="0-3" --cpuset-mems="0" mydbimageVerify the assignment with cat /sys/fs/cgroup/cpuset/db1/cpuset.cpus and monitor cgroup.procs for the expected PIDs.
Limitations and Trade‑offs
Only whole logical CPUs can be allocated; fractional cores are not possible, which caps the number of containers to the number of available CPUs.
System‑wide services (systemd, kernel workers) still run on the default cpuset and can contend for CPU time unless they are explicitly pinned or scheduled with real‑time policies.
Over time the pool of free CPUs can become fragmented, requiring periodic defragmentation or live migration of containers to maintain contiguous core blocks.
Exclusive cpusets eliminate burst capacity: unused cores cannot be borrowed by other containers unless a hybrid approach (cgroups + cpusets) is used.
Results
After migrating stateful services to cpusets:
P99 latency became stable and dropped by ~50 % compared with quota‑based isolation.
Overall CPU provisioning decreased by ~11 % because the need to over‑provision for throttling spikes disappeared.
Performance variance across hosts was reduced, yielding more predictable latency.
References
Original article: https://eng.uber.com/avoiding-cpu-throttling-in-a-containerized-environment/
NUMA documentation: https://www.kernel.org/doc/html/latest/vm/numa.html
Kubernetes static cpuset policy: https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy
Testing details (GitHub Gist): https://gist.github.com/ubermunck/2f116b7817812ae6255d19a4e10242f4
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
