Cloud Native 10 min read

Eliminating Container CPU Throttling: Uber’s Switch to cpusets Saves 11% of Cores

Uber reduced cluster-wide CPU allocation by 11% and stabilized P99 latency by replacing CPU quota throttling with cpusets, a CPU‑pinning technique that isolates containers on specific cores, eliminating throttling while only slightly increasing P50 latency.

dbaplus Community
dbaplus Community
dbaplus Community
Eliminating Container CPU Throttling: Uber’s Switch to cpusets Saves 11% of Cores

Background

Uber runs stateful services such as MySQL, Cassandra, Elasticsearch, Kafka, HDFS, Redis, Docstore, and Schemaless on a shared container platform that spans 65,000 hosts, 2.4 million CPU cores and 200,000 containers. High utilization is required to keep costs low, but CPU‑quota throttling limited further gains.

Cgroups, CPU Quota, and Cpusets

Linux isolates resources with cgroups . Each container is placed in its own cgroup. Two CPU‑related controllers exist:

cpu – enforces a time‑based quota (quota/period) for the group.

cpuset – pins the group to a specific set of logical CPUs.

The quota is calculated as quota = core_count × period. The default period is 100 ms, so a container that needs two full cores receives a quota of 200 ms per period.

Impact of CPU Quota Throttling

Multithreaded workloads quickly consume their quota. When the quota is exhausted, the kernel throttles the cgroup for the remainder of the period, causing latency spikes. Requests that normally finish in a few milliseconds can exceed 100 ms. Raising the quota per container eliminates throttling but increases the overall CPU allocation, which is expensive at Uber’s scale.

Using Cpusets for Isolation

Cpusets assign a container to a fixed list of logical CPUs. The container can use the full capacity of those CPUs without any quota enforcement, eliminating throttling. In production, enabling cpusets removed all throttling events for a database cluster and reduced the 99th‑percentile (P99) latency by roughly 50 % while keeping the same throughput.

CPU Topology and Core Assignment

Correct core selection requires awareness of modern CPU topology:

Multiple physical CPU sockets per host, each with its own L3 cache.

Each socket contains several cores.

Each core has private L2/L1 caches.

Hyper‑threading exposes logical CPUs; two threads on the same core share most caches.

Non‑contiguous or non‑deterministic core IDs can degrade performance. Uber’s scheduler parses /proc/cpuinfo to obtain the exact hardware layout and prefers physically adjacent cores to minimise cross‑socket traffic.

Implementation Steps

Identify the set of logical CPUs to allocate (e.g., 0‑3,8‑11 for two adjacent cores on each socket).

Create a cpuset cgroup for the container, e.g.:

mkdir -p /sys/fs/cgroup/cpuset/db1
 echo 0-3 > /sys/fs/cgroup/cpuset/db1/cpuset.cpus
 echo 0 > /sys/fs/cgroup/cpuset/db1/cpuset.mems

Move the container’s processes into the cgroup (Docker example):

docker run --cpuset-cpus="0-3" --cpuset-mems="0" mydbimage

Verify the assignment with cat /sys/fs/cgroup/cpuset/db1/cpuset.cpus and monitor cgroup.procs for the expected PIDs.

Limitations and Trade‑offs

Only whole logical CPUs can be allocated; fractional cores are not possible, which caps the number of containers to the number of available CPUs.

System‑wide services (systemd, kernel workers) still run on the default cpuset and can contend for CPU time unless they are explicitly pinned or scheduled with real‑time policies.

Over time the pool of free CPUs can become fragmented, requiring periodic defragmentation or live migration of containers to maintain contiguous core blocks.

Exclusive cpusets eliminate burst capacity: unused cores cannot be borrowed by other containers unless a hybrid approach (cgroups + cpusets) is used.

Results

After migrating stateful services to cpusets:

P99 latency became stable and dropped by ~50 % compared with quota‑based isolation.

Overall CPU provisioning decreased by ~11 % because the need to over‑provision for throttling spikes disappeared.

Performance variance across hosts was reduced, yielding more predictable latency.

References

Original article: https://eng.uber.com/avoiding-cpu-throttling-in-a-containerized-environment/

NUMA documentation: https://www.kernel.org/doc/html/latest/vm/numa.html

Kubernetes static cpuset policy: https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy

Testing details (GitHub Gist): https://gist.github.com/ubermunck/2f116b7817812ae6255d19a4e10242f4

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Resource Managementcontainer orchestrationUberCPU throttlingcpusetsLinux cgroups
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.