How Linux Implements Thread Stacks: Inside NPTL and glibc
This article explains why Linux gives each thread its own stack, describes the historical evolution from LinuxThreads to NPTL, and walks through the glibc implementation that allocates, sizes, and releases thread‑stack memory using mmap and related kernel calls.
Hello, I'm Fei! After a break to study GPU topics, I'm back to share the low‑level principles of Linux memory, focusing on thread stacks.
Unlike process stacks, which are created at program start, each thread needs an independent stack to avoid conflicts during concurrent execution, so Linux implements thread stacks separately.
1. The NPTL Story
Linux originally had no concept of threads; the kernel’s clone system call only creates lightweight processes sharing an address space. Early user‑space projects like LinuxThreads attempted to emulate threads but lacked POSIX compliance and had many shortcomings.
Two major efforts followed: IBM’s NGPT and Red Hat’s NPTL. NGPT was abandoned, leaving NPTL as the standard. Modern pthread_create calls use NPTL, and the source files you’ll see are under the nptl directory.
2. Linux Thread Implementation
Thread creation involves two parts:
User‑space glibc library where pthread_create is implemented.
Kernel‑space clone system call that creates a lightweight process sharing the address space.
The thread stack lives in the user‑space part. The call chain is
pthread_create → __pthread_create_2_1 → ALLOCATE_STACK → create_thread (clone).
2.1 struct pthread
The core data structure is struct pthread, which stores the thread ID, a pointer to the stack memory ( stackblock), and its size ( stackblock_size).
struct pthread {
pid_t tid;
...
void *stackblock;
size_t stackblock_size;
};2.2 Determining Stack Size
ALLOCATE_STACKeventually calls allocate_stack, which picks the stack size from attr->stacksize or falls back to __default_stacksize. The default is computed in init.c based on ulimit values, using ARCH_STACK_DEFAULT_SIZE (32 MiB) or PTHREAD_STACK_MIN (16 KiB) when limits are unreasonable.
static int allocate_stack(const struct pthread_attr *attr, struct pthread **pdp, ...){
size = attr->stacksize ?: __default_stacksize;
...
}2.3 Allocating the User Stack
If no cached stack is available, glibc uses mmap to reserve an anonymous memory region, aligns it, and places the struct pthread at the high end. The allocated region becomes the thread’s stack.
mem = mmap(NULL, size, prot, MAP_PRIVATE|MAP_ANONYMOUS|ARCH_MAP_FLAGS, -1, 0);
pd = (struct pthread *)(((uintptr_t)mem + size - coloring - __static_tls_size) & ~__static_tls_align_m1) - TLS_PRE_TCB_SIZE;
pd->stackblock = mem;
pd->stackblock_size = size;2.4 Releasing the Stack
When a thread exits, glibc removes its stack from the stack_used list and places it into a cache ( stack_cache) for future reuse, eventually freeing it with munmap if not cached.
3. Summary
Each thread requires an independent stack, which Linux provides via mmap in user space, distinct from the process’s default stack. The glibc NPTL library manages both the user‑space struct pthread (including the stack) and the kernel‑space task_struct. The following diagram illustrates the relationship between process and thread stacks.
With the stack memory in place, local variables in your code have a proper storage area. The article focuses on virtual address space; physical memory is only allocated on demand via page faults handled by the buddy system.
Refining Core Development Skills
Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
