Fundamentals 20 min read

Why NUMA Slows Multithreaded Apps and How to Optimize It

This article explains NUMA architecture, its multithreaded performance overheads such as remote memory access, cache synchronization, context and mode switches, interrupt handling, TLB misses, and memory copies, and then presents optimization techniques like NUMA and CPU affinity, IRQ tuning, and large‑page usage.

AI Cyberspace
AI Cyberspace
AI Cyberspace
Why NUMA Slows Multithreaded Apps and How to Optimize It

Introduction

NOTE: In this article, "thread" refers to a kernel thread.

NUMA Architecture

NUMA (Non-Uniform Memory Access) partitions CPUs and main memory into local nodes that can also access remote nodes, reducing the bottleneck of a single memory bus.

Each NUMA node has almost equal resources; local memory is accessed via a local bus, remote memory via a shared bus. Access latency differs between local and remote memory, and remote accesses are common by default.

NUMA provides scalability for servers but supports only a few hundred CPUs/cores due to lack of full memory isolation.

Basic Object Concepts

Node : a NUMA node containing one or more sockets.

Socket : a physical processor package.

Core : physical processor cores within a socket.

Hyper-Thread : logical processors per core, usually two.

Processor : OS‑level logical processor object.

Siblings : relationship between physical and virtual processors.

Example topology: a server with 2 nodes, each with 1 socket, 6 cores per socket, 2 threads per core, total 24 processors.

Viewing Host NUMA Topology

#!/usr/bin/env python
# SPDX-License-Identifier: BSD-3-Clause
# Copyright(c) 2010-2014 Intel Corporation
# Copyright(c) 2017 Cavium, Inc. All rights reserved.

from __future__ import print_function
import sys
try:
    xrange  # Python 2
except NameError:
    xrange = range  # Python 3

sockets = []
cores = []
core_map = {}
base_path = "/sys/devices/system/cpu"
fd = open("{}/kernel_max".format(base_path))
max_cpus = int(fd.read())
fd.close()
for cpu in xrange(max_cpus + 1):
    try:
        fd = open("{}/cpu{}/topology/core_id".format(base_path, cpu))
    except IOError:
        continue
    except:
        break
    core = int(fd.read())
    fd.close()
    fd = open("{}/cpu{}/topology/physical_package_id".format(base_path, cpu))
    socket = int(fd.read())
    fd.close()
    if core not in cores:
        cores.append(core)
    if socket not in sockets:
        sockets.append(socket)
    key = (socket, core)
    if key not in core_map:
        core_map[key] = []
    core_map[key].append(cpu)

print("=" * (47 + len(base_path)))
print("Core and Socket Information (as reported by '{}')".format(base_path))
print("=" * (47 + len(base_path)))
print("cores = ", cores)
print("sockets = ", sockets)
print("")

max_processor_len = len(str(len(cores) * len(sockets) * 2 - 1))
max_thread_count = len(list(core_map.values())[0])
max_core_map_len = (max_processor_len * max_thread_count)  \
                  + len(", ") * (max_thread_count - 1)  \
                  + len('[]') + len('Socket ')
max_core_id_len = len(str(max(cores)))

output = " ".ljust(max_core_id_len + len('Core '))
for s in sockets:
    output += " Socket %s" % str(s).ljust(max_core_map_len - len('Socket '))
print(output)

output = " ".ljust(max_core_id_len + len('Core '))
for s in sockets:
    output += " --------".ljust(max_core_map_len)
    output += " "
print(output)

for c in cores:
    output = "Core %s" % str(c).ljust(max_core_id_len)
    for s in sockets:
        if (s,c) in core_map:
            output += " " + str(core_map[(s, c)]).ljust(max_core_map_len)
        else:
            output += " " * (max_core_map_len + 1)
    print(output)

Output shows two sockets, each with six cores and corresponding logical processors.

$ python cpu_topo.py
======================================================================
Core and Socket Information (as reported by '/sys/devices/system/cpu')
======================================================================

cores =  [0, 1, 2, 3, 4, 5]
sockets =  [0, 1]

       Socket 0    Socket 1
       --------    --------
Core 0 [0]         [6]
Core 1 [1]         [7]
Core 2 [2]         [8]
Core 3 [3]         [9]
Core 4 [4]         [10]
Core 5 [5]         [11]

Multithreaded Performance Overheads in NUMA

1. Remote Memory Access Cost

Access time differs for local vs remote memory. Two CPU allocation strategies: cpu-node-bind (constrain threads to specific NUMA nodes) and phys-cpu-bind (constrain threads to specific CPU cores). Four memory allocation policies: local-alloc , preferred , mem-bind , and inter-leave .

2. Cache Synchronization Across Cores

NUMA Domain Scheduler balances load across cores, but moving a process’s threads across cores can invalidate caches and degrade performance.

Cache visibility : data cached in one core’s L1/L2 is invisible to another core until written back to main memory.

Cache coherence : ensuring shared data stays consistent requires write‑back and invalidation traffic.

Cache invalidation : thread migration forces write‑back and may incur remote memory access.

NUMA Domain Scheduler’s balancing can conflict with high‑concurrency performance.

3. Context‑Switch Overhead

Three context types exist: User Level Context (program counter, registers, stack), Register Context (general registers, PC, status registers), and Kernel Level Context (task_struct, registers, virtual address space). User‑level switches are lightweight; kernel‑level switches involve larger state and cross‑core synchronization, leading to higher cost.

4. CPU Mode Switch Overhead

Mode switches (system calls, soft interrupts, exception handling, kernel thread switches) also consume cycles, though less than full context switches.

5. Interrupt Handling Cost

Hardware interrupts require saving CPU registers, executing the ISR, and restoring state, costing at least 300 CPU cycles and generating cache‑coherence traffic across cores.

6. TLB Miss Cost

Frequent kernel‑thread switches can evict TLB entries, causing many misses.

7. Memory Copy Cost

In network processing, copying packets from NIC driver to kernel stack and then to user space can consume about 57 % of total processing time.

Performance Optimizations: Prefer Multicore Programming Over Multithreading

Bind kernel threads to specific NUMA nodes or cores to avoid costly scheduling.

NUMA Affinity

Use numastat to view memory distribution and numactl to bind processes or threads to nodes or cores.

CPU Affinity

CPU affinity keeps a thread on the same core, improving cache hit rate and reducing remote memory accesses. Soft affinity is the default scheduler behavior; hard affinity can be set with taskset and modifies the cpus_allowed mask in the task_struct.

IRQ Affinity

Tools like irqbalance distribute interrupts across CPUs (performance mode) or concentrate them on one CPU (power‑save mode). Manual IRQ affinity can be set via smp_affinity after disabling irqbalance.

Use Large Pages

Large pages reduce TLB pressure; see the referenced article on Linux large‑page implementation.

PerformanceSystem ArchitectureLinuxmultithreadingNUMAcpu_affinity
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.