Backend Development 21 min read

Load Balancing Between Worker Threads in Multithreaded Server Programs

The article explains how kernel‑level load balancing works, why naïve thread‑pool wake‑up can cause CPU imbalance, and proposes using thread affinity, priority layers and a custom layered condition variable to achieve better performance in multithreaded server applications on Linux.

Art of Distributed System Architecture Design

Jul 31, 2015

Load Balancing Between Worker Threads in Multithreaded Server Programs

Load Balance Overview

Typical load‑balance discussions focus on distributing traffic among service replicas or the kernel’s own load‑balancing across CPUs. The former is a simple entry‑point traffic split, while the latter continuously migrates running processes to keep each CPU fully utilized.

This article examines a different kind of load balance: the distribution of work among worker threads (or processes) inside a multithreaded server. A common server model has a receiver thread that accepts requests and a thread‑pool of workers that process them, communicating via pthread_cond and a request queue. Normally the receiver pushes a request onto the queue and signals the condition variable, leaving the kernel to decide which waiting worker to wake.

Kernel Load Balance Overview

The kernel’s load‑balancing goal is to spread RUNNING processes evenly across all scheduling domains (CPU, core, hyper‑thread). For example, on a system with 2 sockets, each with 2 cores and 2 hyper‑threads (8 logical CPUs), true balance means each physical CPU runs two processes and each core runs one, not merely one process per logical CPU.

This strict balance reduces cache contention and pipeline competition, improving fairness and performance. The kernel performs this balancing asynchronously, not in real time, to avoid excessive overhead.

Server Load‑Balance Considerations

Given the kernel’s balancing, the receiver can improve performance by limiting the number of workers to roughly the number of logical CPUs and by using thread affinity to pin workers to specific CPUs. Over‑provisioning workers (e.g., 80 workers on an 8‑CPU machine) leads to random wake‑ups and severe imbalance, as illustrated by a probability calculation showing only a 0.34% chance that eight simultaneous requests land on eight different CPUs.

Better strategies include using a LIFO stack instead of a FIFO queue, or more effectively, fixing the worker count to the CPU count and employing sched_affinity so each worker stays on a designated CPU.

To further improve balance, the receiver should know each worker’s CPU placement and assign priorities (e.g., first hyper‑thread of each core gets higher priority). This can be implemented with a layered futex, where each layer represents a priority level, allowing the receiver to wake workers on different cores preferentially.

Example Program

A simple producer‑consumer test program demonstrates the concepts. It creates a producer thread and multiple consumer (worker) threads, passing tasks via condition variables and queues. Tasks are either CPU‑intensive calculations or memory‑intensive mmap operations, and the program measures total execution time under various configurations.

Key command‑line options: -j selects job type: shm (memory‑mapped file) or calc (arithmetic). -t sets the number of worker threads. -o sets task load per worker. -c sets how many tasks each worker processes. -a enables CPU affinity; the value determines the stride of CPU IDs assigned to workers. -l enables the layered condition variable with two priority layers.

Experimental results show that using too many workers degrades performance, while appropriate affinity and layered conditions can improve throughput, especially for cache‑friendly shm jobs. However, for pure CPU‑bound calc jobs, improper affinity may halve performance due to hyper‑thread contention.

Source Code

#include <pthread.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sched.h>
#include <sys/types.h>
#include <errno.h>
#include <string.h>
#include <linux/futex.h>
#include <sys/time.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <math.h>
#include <sys/syscall.h>

#define CPUS    24
#define FUTEX_WAIT_BITSET   9
#define FUTEX_WAKE_BITSET   10

struct Job { long _input; long _output; };

class JobRunner { public: virtual void run(Job* job) = 0; };

class ShmJobRunner : public JobRunner { /* ... */ };
class CalcJobRunner : public JobRunner { /* ... */ };

class JobRunnerCreator { public: static JobRunner* create(const char* name, const char* filepath, size_t filelength) { /* ... */ } };

class Cond { public: virtual void lock() = 0; virtual void unlock() = 0; virtual void wait(size_t) = 0; virtual void wake() = 0; };

class NormalCond : public Cond { /* ... */ };
class LayeredCond : public Cond { /* ... */ };

template<class T> class Stack { /* ... */ };

inline struct timeval cost_begin() { struct timeval tv; gettimeofday(&tv, NULL); return tv; }
inline long cost_end(struct timeval &tv) { struct timeval tv2; gettimeofday(&tv2, NULL); tv2.tv_sec -= tv.tv_sec; tv2.tv_usec -= tv.tv_usec; return tv2.tv_sec*1000+tv2.tv_usec/1000; }

struct ThreadParam { size_t layer; Stack<Job>* inputQ; Stack<Job>* outputQ; JobRunner* runner; };

void* thread_func(void *data) { /* ... */ }
void force_cpu(pthread_t t, int n) { /* ... */ }
void usage(const char* bin) { /* ... */ }

int main(int argc, char* const* argv) { /* ... */ }

The article concludes that while kernel load‑balancing generally helps, specific workloads (especially those benefiting from cache locality) may perform better with tailored affinity and priority schemes, and that developers should analyze each case individually.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

load balancing Performance Tuning thread affinity multithreading Kernel Scheduling

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.