Solving CPU Performance Layering in Heterogeneous Data Centers: A Practical Guide
This article explains why heterogeneous servers cause CPU performance layering, describes how to detect the issue using metrics such as NUMA hit/miss rates, cache miss ratios and frequency states, and provides step‑by‑step remediation techniques—including NUMA binding, cache isolation, recompilation and frequency locking—to improve resource pooling efficiency in modern data centers.
1. Background
Since Moore's law was proposed in 1965, semiconductor innovation has driven increasingly diverse IDC architectures (Intel, AMD, ARM). Early small clusters managed resources separately, but as hardware generations and architectures multiplied, resource isolation and inability to share compute across clusters raised costs and lowered utilization. Resource pooling via virtualization, containers, and intelligent scheduling abstracts heterogeneous hardware into a unified pool, enabling global sharing of CPU, memory, and network resources, higher utilization, and elastic scaling.
2. Investigation Methodology
The CPU layering phenomenon appears as multiple tiers of CPU utilization for the same workload, making capacity planning difficult. High‑tier machines waste resources, while low‑tier machines cannot meet SLA.
Root causes are divided into two categories:
Hardware differences (CPU, memory, disk, network).
Software differences (drivers, libraries, toolchains).
Benchmarking the baseline compute capability of each machine type helps normalize the observed differences. Example benchmark values:
Machine Baseline Score
Xeon Silver 4110*2 320
Xeon E5-2630 v4*2 448
AMD EPYC 7502P*1 883
AMD EPYC 7402P*1 797
...3. NUMA Issue Scenario
3.1 Problem Description
NUMA creates CPU layering mainly for memory‑intensive workloads because remote node memory accesses incur higher latency (Node Distance). For a dual‑node Intel Xeon E5‑2630 v4, intra‑node distance is 10 units while inter‑node distance is 21 units, leading to noticeable performance drops when a process migrates across nodes.
3.2 Metric Feedback
Use numastat to view NUMA hit/miss statistics. The three most relevant metrics are numa_hit , numa_miss , and numa_foreign .
# numastat
node0 node1
numa_hit 12730959061 10270948448
numa_miss 0 571202
numa_foreign 571202 0
...3.3 Solution Approach
Bind a process to a single NUMA node with numactl or cgroup, enable kernel NUMA balancing, or use interleave allocation.
# numactl --cpunodebind=0 --membind=0 ./your_program
# echo 1 | sudo tee /proc/sys/kernel/numa_balancing4. Cache Issue Scenario
4.1 Problem Description
Different CPU cache sizes and architectures cause varying L3 cache miss rates, which directly affect performance layering. For example, an Intel i7‑6700 has an 8 MB L3 cache, while an AMD Ryzen 5800X provides 32 MB, allowing the latter to hold a 10 MB working set entirely.
4.2 Metric Feedback
Measure L3 cache miss rate with perf or llcstat:
# perf stat -a -e LLC-load-misses,LLC-loads -- sleep 10
2,439,039 LLC-load-misses # 28.50% of all LLC accesses
8,558,008 LLC-loadsOr using llcstat for per‑process details.
4.3 Solution Approach
Re‑compile binaries for each CPU architecture, align data structures to cache‑line boundaries, and optionally isolate L3 cache via resctrl:
# cat /sys/fs/resctrl/schemata
L3:0=fffff;1=fffff
# mkdir /sys/fs/resctrl/mygroup
# echo "L3:0=0x0000f;" | tee /sys/fs/resctrl/mygroup/schemata
# echo <PID> | tee /sys/fs/resctrl/mygroup/tasks5. Frequency Throttling Issue Scenario
5.1 Problem Description
Dynamic frequency scaling (Turbo Boost, AVX‑induced throttling) can lower CPU frequency under high load or heavy AVX usage, causing a noticeable drop in compute throughput and contributing to CPU layering.
5.2 Metric Feedback
Inspect P‑state and C‑state via /proc/cpuinfo, cpupower, or monitoring panels.
# cat /proc/cpuinfo | grep MHz
cpu MHz : 2397.279
...
# cpupower -c all frequency-info
hardware limits: 800 MHz - 3.80 GHz
available governors: performance powersave
current policy: 800 MHz - 3.80 GHz
current CPU frequency: 2.90 GHz5.3 Solution Approach
Lock the frequency in the OS by setting the same min and max frequency for each CPU, or configure a fixed frequency in BIOS.
# for cpu in /sys/devices/system/cpu/cpu*/cpufreq; do
echo "3500000" | sudo tee $cpu/scaling_min_freq
echo "3500000" | sudo tee $cpu/scaling_max_freq
done
# watch -n 1 "cat /proc/cpuinfo | grep MHz"6. Summary and Outlook
Mitigating CPU layering in heterogeneous data‑center environments requires addressing NUMA latency, cache‑miss disparities, and frequency throttling. While these techniques improve resource‑pool efficiency and economic returns, the problem remains complex due to diverse hardware and limited tooling. Ongoing collaboration and data‑driven refinements will be essential to approach an optimal solution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
