Operations 24 min read

Solving CPU Performance Layering in Heterogeneous Data Centers: A Practical Guide

This article explains why heterogeneous servers cause CPU performance layering, describes how to detect the issue using metrics such as NUMA hit/miss rates, cache miss ratios and frequency states, and provides step‑by‑step remediation techniques—including NUMA binding, cache isolation, recompilation and frequency locking—to improve resource pooling efficiency in modern data centers.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Solving CPU Performance Layering in Heterogeneous Data Centers: A Practical Guide

1. Background

Since Moore's law was proposed in 1965, semiconductor innovation has driven increasingly diverse IDC architectures (Intel, AMD, ARM). Early small clusters managed resources separately, but as hardware generations and architectures multiplied, resource isolation and inability to share compute across clusters raised costs and lowered utilization. Resource pooling via virtualization, containers, and intelligent scheduling abstracts heterogeneous hardware into a unified pool, enabling global sharing of CPU, memory, and network resources, higher utilization, and elastic scaling.

Resource pooling diagram
Resource pooling diagram

2. Investigation Methodology

The CPU layering phenomenon appears as multiple tiers of CPU utilization for the same workload, making capacity planning difficult. High‑tier machines waste resources, while low‑tier machines cannot meet SLA.

CPU utilization layering
CPU utilization layering

Root causes are divided into two categories:

Hardware differences (CPU, memory, disk, network).

Software differences (drivers, libraries, toolchains).

Benchmarking the baseline compute capability of each machine type helps normalize the observed differences. Example benchmark values:

Machine               Baseline Score
Xeon Silver 4110*2    320
Xeon E5-2630 v4*2    448
AMD EPYC 7502P*1      883
AMD EPYC 7402P*1      797
...

3. NUMA Issue Scenario

3.1 Problem Description

NUMA creates CPU layering mainly for memory‑intensive workloads because remote node memory accesses incur higher latency (Node Distance). For a dual‑node Intel Xeon E5‑2630 v4, intra‑node distance is 10 units while inter‑node distance is 21 units, leading to noticeable performance drops when a process migrates across nodes.

Intel Node Distance table
Intel Node Distance table

3.2 Metric Feedback

Use numastat to view NUMA hit/miss statistics. The three most relevant metrics are numa_hit , numa_miss , and numa_foreign .

# numastat
          node0          node1
numa_hit   12730959061    10270948448
numa_miss  0              571202
numa_foreign 571202       0
...

3.3 Solution Approach

Bind a process to a single NUMA node with numactl or cgroup, enable kernel NUMA balancing, or use interleave allocation.

# numactl --cpunodebind=0 --membind=0 ./your_program
# echo 1 | sudo tee /proc/sys/kernel/numa_balancing

4. Cache Issue Scenario

4.1 Problem Description

Different CPU cache sizes and architectures cause varying L3 cache miss rates, which directly affect performance layering. For example, an Intel i7‑6700 has an 8 MB L3 cache, while an AMD Ryzen 5800X provides 32 MB, allowing the latter to hold a 10 MB working set entirely.

Cache latency differences
Cache latency differences

4.2 Metric Feedback

Measure L3 cache miss rate with perf or llcstat:

# perf stat -a -e LLC-load-misses,LLC-loads -- sleep 10
  2,439,039 LLC-load-misses   # 28.50% of all LLC accesses
  8,558,008 LLC-loads

Or using llcstat for per‑process details.

4.3 Solution Approach

Re‑compile binaries for each CPU architecture, align data structures to cache‑line boundaries, and optionally isolate L3 cache via resctrl:

# cat /sys/fs/resctrl/schemata
L3:0=fffff;1=fffff
# mkdir /sys/fs/resctrl/mygroup
# echo "L3:0=0x0000f;" | tee /sys/fs/resctrl/mygroup/schemata
# echo <PID> | tee /sys/fs/resctrl/mygroup/tasks

5. Frequency Throttling Issue Scenario

5.1 Problem Description

Dynamic frequency scaling (Turbo Boost, AVX‑induced throttling) can lower CPU frequency under high load or heavy AVX usage, causing a noticeable drop in compute throughput and contributing to CPU layering.

5.2 Metric Feedback

Inspect P‑state and C‑state via /proc/cpuinfo, cpupower, or monitoring panels.

# cat /proc/cpuinfo | grep MHz
cpu MHz : 2397.279
...
# cpupower -c all frequency-info
hardware limits: 800 MHz - 3.80 GHz
available governors: performance powersave
current policy: 800 MHz - 3.80 GHz
current CPU frequency: 2.90 GHz

5.3 Solution Approach

Lock the frequency in the OS by setting the same min and max frequency for each CPU, or configure a fixed frequency in BIOS.

# for cpu in /sys/devices/system/cpu/cpu*/cpufreq; do
  echo "3500000" | sudo tee $cpu/scaling_min_freq
  echo "3500000" | sudo tee $cpu/scaling_max_freq
 done
# watch -n 1 "cat /proc/cpuinfo | grep MHz"

6. Summary and Outlook

Mitigating CPU layering in heterogeneous data‑center environments requires addressing NUMA latency, cache‑miss disparities, and frequency throttling. While these techniques improve resource‑pool efficiency and economic returns, the problem remains complex due to diverse hardware and limited tooling. Ongoing collaboration and data‑driven refinements will be essential to approach an optimal solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cache optimizationData CenterCPU performanceNUMAresource pooling
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.