How CPU Architecture Bottlenecks Cripple Netflix’s Container Scaling
Netflix discovered that scaling hundreds of containers on modern CPUs hit severe lock‑contention due to mount‑related kernel locks, with performance varying across AWS instance types, NUMA designs, and hyper‑threading, leading them to redesign containerd mounting and choose hardware‑aware scheduling to restore efficient scaling.
Problem
When Netflix expands its fleet of containers to serve millions of users, the rapid provisioning of new AWS instances and the assignment of Pods to those nodes can cause a node to transition from idle to fully saturated in seconds. During the migration from a legacy kubelet + Docker stack to a modern kubelet + containerd runtime, a critical bottleneck emerged: CPU architecture‑related lock contention.
Challenge
On nodes running the r5.metal instance type, container creation triggered a dramatic increase in the mount table length, causing health‑check time‑outs and frequent kubelet‑containerd communication failures. The root cause was a massive number of mount and unmount operations performed by containerd while assembling the overlay‑fs root filesystem for each container layer, leading to contention on several global VFS locks.
Diagnosis
Investigation revealed the following mount sequence for each layer when using user namespaces:
Call open_tree() to obtain a reference to the layer directory.
Call mount_setattr() to set an ID‑map matching the container’s user namespace.
Call move_mount() to create a bind mount on the host.
These bind mounts are removed after the overlay‑fs root is assembled, but when many containers start concurrently, each CPU core spends most of its time acquiring the global mount locks. For example, launching 100 containers each with a 50‑layer image requires roughly 20 200 mount operations, each contending for the same kernel locks.
Assembly of the mount‑related code path is shown in the flame graph (Fig 1).
Hardware Benchmark
To understand why the issue manifested primarily on r5.metal, Netflix benchmarked container startup across three AWS instance families:
r5.metal – dual‑socket Intel Xeon, multiple NUMA domains.
m7i.metal‑24xl – single‑socket Intel, single NUMA domain.
m7a.24xlarge – single‑socket AMD, single NUMA domain.
Baseline results (Fig 3) showed that at low concurrency (< 20 containers) all platforms performed similarly, but as concurrency grew, r5.metal began to fail around 100 containers, while the newer instances maintained lower latency and higher success rates. The m7a series exhibited the most stable scaling.
Deep Dive into the Bottleneck
Performance profiling identified the VFS path_init() function as the hot spot, where threads spin on a sequence lock while waiting for the global mount lock (see assembly snippet below).
path_init():
mov mount_lock, %eax
test $0x1, %al
je 7c
pause
...Further analysis using Intel’s Top‑Down Microarchitecture (TMA) showed that 95.5% of pipeline slots were stalled by contested accesses, with 57% caused by false sharing. The root cause was identified as cache‑line contention on global lock structures.
NUMA effects were examined: dual‑socket systems with separate memory domains (r5.metal) suffered from remote memory latency when threads accessed locks located on the opposite socket. Disabling hyper‑threading on the m7i.metal‑24xl reduced latency by 20‑30% because logical CPUs no longer shared execution resources that amplified lock contention.
Two cache‑architecture models were compared:
Centralized cache architecture – a single request queue (TOR) becomes a bottleneck when many threads contend for the same lock.
Distributed cache architecture – lock traffic is spread across multiple chiplet domains, reducing contention.
Micro‑benchmarking a custom program that spawns many threads contending on a global lock confirmed the observations: eliminating NUMA (single socket) and disabling HT both improved latency, and the distributed cache design of the m7a instance delivered the best scaling.
Software Improvements
Collaboration with the upstream containerd team produced two mitigation strategies:
Adopt the newer kernel mount API fsconfig() with lowerdir+ support, passing the ID‑mapped lower directory as a file descriptor instead of a path, thereby avoiding the costly move_mount() system call.
Mount all layers under a shared parent directory, reducing the number of mount operations from O(n) to O(1), where n is the number of image layers.
The second approach was chosen because it works on existing kernels and immediately removes the mount‑related hot path from the containerd flame graph (Fig 11).
Conclusion
The migration to a modern kubelet + containerd stack exposed a deep coupling between container startup patterns and CPU hardware characteristics. While user‑namespace isolation improves security, it also amplifies mount‑related lock contention on multi‑NUMA, hyper‑threaded CPUs. By selecting hardware with distributed cache architectures, disabling hyper‑threading where appropriate, and simplifying the mount workflow, Netflix achieved stable, high‑throughput container scaling without relying on any specific CPU model.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
