Operations 45 min read

Why the Thundering Herd Problem Slows Your Linux Server and How to Fix It

The thundering herd problem in Linux causes multiple processes or threads to wake up simultaneously for a single event, leading to wasted CPU cycles, excessive context switches, lock contention, and severe performance degradation, but it can be mitigated with kernel tweaks, epoll flags, SO_REUSEPORT, thread pools, and other strategies.

Deepin Linux
Deepin Linux
Deepin Linux
Why the Thundering Herd Problem Slows Your Linux Server and How to Fix It

Part 1 – What Is the Thundering Herd Problem?

In the complex world of Linux servers, performance optimization is an endless marathon, and among the many hidden "reefs" that affect performance, the thundering herd effect is both subtle and far‑reaching. Imagine several processes or threads like a flock of birds waiting in a nest for the same event; when the event occurs, all birds are startled and rush forward, but only one bird can actually obtain and handle the event while the others return to wait again. This vivid analogy illustrates the Linux thundering herd problem.

The effect is far from harmless. It triggers frequent, useless scheduling of user processes or threads, wasting valuable CPU time on context switches and causing system performance to plummet. To ensure that only one process acquires the resource, developers must introduce locking, which adds further overhead. From early Linux kernels to modern versions, the thundering herd has evolved; the accept() herd was solved in the kernel after Linux 2.6, but new issues such as epoll herd continue to challenge developers. The following sections dissect the problem and explore effective strategies to eliminate wasted performance.

Part 2 – Harmful Effects

2.1 System Performance Loss

When the thundering herd occurs, the Linux kernel repeatedly performs invalid scheduling and context switching for user processes or threads, dramatically reducing system performance. High context‑switch rates turn the CPU into a busy mover, spending most of its time saving and restoring registers and run‑queue states instead of doing useful work.

Consider a simple web server using a multi‑process model where each process blocks on accept(). When a new connection arrives, all blocked processes are awakened, but only one can successfully accept the connection; the others return to sleep, incurring additional scheduling, register saving/loading, and cache invalidation costs.

2.2 Resource Contention and Lock Overhead

To guarantee that only one process or thread obtains the resource during a herd, developers must protect the operation with locks, introducing new performance bottlenecks. For example, Nginx enables a mutex lock by default to ensure a single process handles a new connection, avoiding herd‑induced performance loss, yet the lock itself adds overhead for acquisition, release, and possible dead‑lock detection.

Part 3 – Common Herd Scenarios

3.1 accept Herd

In network programming, the accept herd is common. A parent process creates a listening socket, then forks multiple child processes that inherit the socket and call accept() in a loop. When a new connection arrives, all children blocked on accept() are awakened, but only one succeeds; the others receive EAGAIN and return to sleep, wasting CPU cycles.

Before Linux 2.6, this issue existed; the kernel introduced an exclusive wait‑queue flag (WQ_FLAG_EXCLUSIVE) to wake only the first waiting process, effectively solving the problem.

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <string.h>
#include <netinet/in.h>
#include <unistd.h>

#define PROCESS_NUM 10

int main() {
    int fd = socket(PF_INET, SOCK_STREAM, 0);
    int connfd;
    int pid;
    char sendbuff[1024];
    struct sockaddr_in serveraddr;
    serveraddr.sin_family = AF_INET;
    serveraddr.sin_addr.s_addr = htonl(INADDR_ANY);
    serveraddr.sin_port = htons(1234);
    bind(fd, (struct sockaddr *)&serveraddr, sizeof(serveraddr));
    listen(fd, 1024);

    for (int i = 0; i < PROCESS_NUM; ++i) {
        pid = fork();
        if (pid == 0) {
            while (1) {
                connfd = accept(fd, NULL, NULL);
                sprintf(sendbuff, "process PID = %d
", getpid());
                send(connfd, sendbuff, strlen(sendbuff)+1, 0);
                printf("process %d accept success
", getpid());
                close(connfd);
            }
        }
    }
    wait(0);
    return 0;
}

Compile and run the code, then connect with telnet 127.0.0.1 1234. Only one process will successfully accept the connection, demonstrating that the accept herd has been resolved in modern kernels.

3.2 epoll Herd

epoll is a high‑performance I/O notification mechanism, but improper use can also cause a herd. Two typical scenarios are:

Fork‑before‑creating epollfd: The parent creates an epoll instance, then forks children that inherit the same epollfd. When a new connection arrives, the kernel wakes all children, leading to a herd.

Fork‑after‑creating epollfd: Each child creates its own epoll instance after forking. Because the kernel cannot decide which child should be notified, it wakes all children, and the herd persists.

Below is a verification program for the second scenario:

#include <sys/types.h>
#include <sys/socket.h>
#include <sys/epoll.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/wait.h>

#define IP   "127.0.0.1"
#define PORT  8888
#define PROCESS_NUM 4
#define MAXEVENTS 64

static int create_and_bind() {
    int fd = socket(PF_INET, SOCK_STREAM, 0);
    struct sockaddr_in serveraddr;
    serveraddr.sin_family = AF_INET;
    inet_pton(AF_INET, IP, &serveraddr.sin_addr);
    serveraddr.sin_port = htons(PORT);
    bind(fd, (struct sockaddr *)&serveraddr, sizeof(serveraddr));
    return fd;
}

static int make_socket_non_blocking(int sfd) {
    int flags = fcntl(sfd, F_GETFL, 0);
    if (flags == -1) { perror("fcntl"); return -1; }
    flags |= O_NONBLOCK;
    if (fcntl(sfd, F_SETFL, flags) == -1) { perror("fcntl"); return -1; }
    return 0;
}

int worker(int sfd, int efd, struct epoll_event *events, int k) {
    while (1) {
        int n = epoll_wait(efd, events, MAXEVENTS, -1);
        sleep(1);
        printf("worker %d return from epoll_wait!
", k);
        for (int i = 0; i < n; i++) {
            if ((events[i].events & EPOLLERR) || (events[i].events & EPOLLHUP) || !(events[i].events & EPOLLIN)) {
                fprintf(stderr, "epoll error
");
                close(events[i].data.fd);
                continue;
            } else if (sfd == events[i].data.fd) {
                while (1) {
                    struct sockaddr in_addr;
                    socklen_t in_len = sizeof(in_addr);
                    int infd = accept(sfd, &in_addr, &in_len);
                    if (infd == -1) {
                        if (errno == EAGAIN || errno == EWOULDBLOCK) break;
                        perror("accept"); break;
                    }
                    char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];
                    if (getnameinfo(&in_addr, in_len, hbuf, sizeof(hbuf), sbuf, sizeof(sbuf), NI_NUMERICHOST | NI_NUMERICSERV) == 0)
                        printf("Accepted connection on %d (host=%s, port=%s)
", infd, hbuf, sbuf);
                    make_socket_non_blocking(infd);
                    struct epoll_event ev = { .data.fd = infd, .events = EPOLLIN | EPOLLET };
                    epoll_ctl(efd, EPOLL_CTL_ADD, infd, &ev);
                }
                continue;
            }
        }
    }
    return 0;
}

int main() {
    int sfd = create_and_bind();
    make_socket_non_blocking(sfd);
    listen(sfd, SOMAXCONN);
    for (int i = 0; i < PROCESS_NUM; ++i) {
        pid_t pid = fork();
        if (pid == 0) {
            int efd = epoll_create1(0);
            struct epoll_event ev = { .data.fd = sfd, .events = EPOLLIN | EPOLLET };
            epoll_ctl(efd, EPOLL_CTL_ADD, sfd, &ev);
            struct epoll_event *events = calloc(MAXEVENTS, sizeof *events);
            worker(sfd, efd, events, i);
            free(events);
            close(efd);
            exit(0);
        }
    }
    for (int i = 0; i < PROCESS_NUM; ++i) wait(NULL);
    close(sfd);
    return 0;
}

3.3 poll/select Herd

Both poll() and select() suffer from the herd problem. The following simple poll‑based server demonstrates that multiple processes are awakened, but only one can accept the new connection.

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <string.h>
#include <netinet/in.h>
#include <unistd.h>
#include <errno.h>
#include <poll.h>

#define PROCESS_NUM 10

int main() {
    int fd = socket(PF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
    bind(fd, ...);
    listen(fd, 1024);
    for (int i = 0; i < PROCESS_NUM; ++i) {
        pid_t pid = fork();
        if (pid == 0) {
            struct pollfd pfd = { .fd = fd, .events = POLLIN };
            while (1) {
                int ret = poll(&pfd, 1, -1);
                if (ret > 0 && (pfd.revents & POLLIN)) {
                    int new_fd = accept(fd, NULL, NULL);
                    if (new_fd >= 0) {
                        printf("process %d accepted
", getpid());
                        close(new_fd);
                    }
                }
            }
        }
    }
    wait(0);
    return 0;
}

Part 4 – Solutions to the Thundering Herd

4.1 epoll + EPOLLEXCLUSIVE

Linux 4.5 introduced the EPOLLEXCLUSIVE flag. Adding this flag when registering a listening socket ensures that only one waiting process or thread is awakened for a new connection, eliminating the herd.

#include <sys/epoll.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define PORT 8080
#define MAX_EVENTS 10

int main() {
    int server_fd = socket(AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in address = { .sin_family = AF_INET, .sin_addr.s_addr = INADDR_ANY, .sin_port = htons(PORT) };
    bind(server_fd, (struct sockaddr *)&address, sizeof(address));
    listen(server_fd, 10);
    int epfd = epoll_create1(0);
    struct epoll_event ev = { .data.fd = server_fd, .events = EPOLLIN | EPOLLEXCLUSIVE };
    epoll_ctl(epfd, EPOLL_CTL_ADD, server_fd, &ev);
    struct epoll_event events[MAX_EVENTS];
    while (1) {
        int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
        for (int i = 0; i < n; i++) {
            if (events[i].data.fd == server_fd) {
                int client = accept(server_fd, NULL, NULL);
                if (client != -1) {
                    char buf[1024];
                    ssize_t len = read(client, buf, sizeof(buf));
                    if (len > 0) send(client, buf, len, 0);
                    close(client);
                }
            }
        }
    }
    close(server_fd);
    close(epfd);
    return 0;
}

4.2 Load Balancing with SO_REUSEPORT

Since Linux 3.9, the SO_REUSEPORT socket option allows multiple sockets to bind to the same port. The kernel then distributes incoming connections among the sockets, providing natural load balancing and eliminating the herd.

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <unistd.h>
#include <string.h>

#define PORT 8888
#define WORKER 4

int worker(int i) {
    int fd = socket(PF_INET, SOCK_STREAM, 0);
    int val = 1;
    setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &val, sizeof(val));
    struct sockaddr_in addr = { .sin_family = AF_INET, .sin_addr.s_addr = inet_addr("127.0.0.1"), .sin_port = htons(PORT) };
    bind(fd, (struct sockaddr *)&addr, sizeof(addr));
    listen(fd, 5);
    while (1) {
        struct sockaddr_in client;
        socklen_t clilen = sizeof(client);
        int conn = accept(fd, (struct sockaddr *)&client, &clilen);
        if (conn != -1) {
            char buf[1024];
            ssize_t n = recv(conn, buf, sizeof(buf), 0);
            if (n > 0) {
                send(conn, buf, n, 0);
                printf("Worker %d received: %.*s
", i, (int)n, buf);
            }
            close(conn);
        }
    }
    close(fd);
    return 0;
}

int main() {
    for (int i = 0; i < WORKER; i++) {
        pid_t pid = fork();
        if (pid == 0) {
            worker(i);
            exit(0);
        }
    }
    for (int i = 0; i < WORKER; i++) wait(NULL);
    return 0;
}

4.3 Thread‑Pool Model

A thread pool creates a fixed number of worker threads that process incoming connections from a task queue, avoiding the cost of repeatedly creating threads and reducing context‑switch overhead.

Main thread listens on the socket and enqueues accepted connections.

Worker threads dequeue tasks and handle the I/O.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>

#define PORT 8080
#define THREAD_NUM 4
#define QUEUE_SIZE 100

typedef struct { int fd; } Task;

typedef struct {
    Task queue[QUEUE_SIZE];
    int front, rear;
    pthread_mutex_t mutex;
    pthread_cond_t cond;
} TaskQueue;

TaskQueue taskQueue;

void initTaskQueue() {
    taskQueue.front = taskQueue.rear = 0;
    pthread_mutex_init(&taskQueue.mutex, NULL);
    pthread_cond_init(&taskQueue.cond, NULL);
}

void enqueueTask(int fd) {
    pthread_mutex_lock(&taskQueue.mutex);
    while ((taskQueue.rear + 1) % QUEUE_SIZE == taskQueue.front)
        pthread_cond_wait(&taskQueue.cond, &taskQueue.mutex);
    taskQueue.queue[taskQueue.rear].fd = fd;
    taskQueue.rear = (taskQueue.rear + 1) % QUEUE_SIZE;
    pthread_cond_signal(&taskQueue.cond);
    pthread_mutex_unlock(&taskQueue.mutex);
}

int dequeueTask() {
    pthread_mutex_lock(&taskQueue.mutex);
    while (taskQueue.front == taskQueue.rear)
        pthread_cond_wait(&taskQueue.cond, &taskQueue.mutex);
    int fd = taskQueue.queue[taskQueue.front].fd;
    taskQueue.front = (taskQueue.front + 1) % QUEUE_SIZE;
    pthread_cond_signal(&taskQueue.cond);
    pthread_mutex_unlock(&taskQueue.mutex);
    return fd;
}

void *worker(void *arg) {
    while (1) {
        int fd = dequeueTask();
        char buf[1024] = {0};
        ssize_t n = read(fd, buf, sizeof(buf));
        if (n > 0) {
            send(fd, buf, n, 0);
            printf("Message from client: %s
", buf);
        }
        close(fd);
    }
    return NULL;
}

int main() {
    int server_fd = socket(AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in addr = { .sin_family = AF_INET, .sin_addr.s_addr = INADDR_ANY, .sin_port = htons(PORT) };
    bind(server_fd, (struct sockaddr *)&addr, sizeof(addr));
    listen(server_fd, 10);
    initTaskQueue();
    pthread_t threads[THREAD_NUM];
    for (int i = 0; i < THREAD_NUM; i++)
        pthread_create(&threads[i], NULL, worker, NULL);
    while (1) {
        struct sockaddr_in client;
        socklen_t clilen = sizeof(client);
        int conn = accept(server_fd, (struct sockaddr *)&client, &clilen);
        if (conn != -1) enqueueTask(conn);
    }
    return 0;
}

4.4 Spin‑Lock + Wake‑Up Optimization

In highly contended scenarios, a spin‑lock can reduce context‑switch overhead, and combining it with condition variables or semaphores allows precise control over which thread is awakened, further mitigating the herd effect.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#define THREAD_NUM 10

pthread_spinlock_t spinlock;
pthread_cond_t cond;
int shared_resource = 0;

void *thread_function(void *arg) {
    int id = *(int *)arg;
    pthread_spin_lock(&spinlock);
    while (shared_resource == 0)
        pthread_cond_wait(&cond, &spinlock);
    printf("Thread %d processing shared resource
", id);
    shared_resource--;
    pthread_spin_unlock(&spinlock);
    return NULL;
}

int main() {
    pthread_t threads[THREAD_NUM];
    int ids[THREAD_NUM];
    pthread_spin_init(&spinlock, PTHREAD_PROCESS_PRIVATE);
    pthread_cond_init(&cond, NULL);
    for (int i = 0; i < THREAD_NUM; i++) {
        ids[i] = i;
        pthread_create(&threads[i], NULL, thread_function, &ids[i]);
    }
    pthread_spin_lock(&spinlock);
    shared_resource = 1;
    pthread_cond_signal(&cond);
    pthread_spin_unlock(&spinlock);
    for (int i = 0; i < THREAD_NUM; i++) pthread_join(threads[i], NULL);
    pthread_spin_destroy(&spinlock);
    pthread_cond_destroy(&cond);
    return 0;
}

Part 5 – epoll Herd Case Study

Conclusion: In Edge‑Triggered (ET) mode, the herd problem does not occur; in Level‑Triggered (LT) mode, it does.

The following program demonstrates the behavior in both modes. When EPOLLET is set (ET), only one process repeatedly handles new connections. When EPOLLET is omitted (LT), multiple processes may be awakened, and the number of awakened processes depends on how long each process holds the event.

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <string.h>
#include <netinet/in.h>
#include <unistd.h>
#include <errno.h>
#include <sys/epoll.h>

#define MAXEVENTS 64
#define PROCESS_NUM 10

int main() {
    int fd = socket(PF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
    struct sockaddr_in serveraddr = { .sin_family = AF_INET, .sin_addr.s_addr = htonl(INADDR_ANY), .sin_port = htons(2222) };
    bind(fd, (struct sockaddr *)&serveraddr, sizeof(serveraddr));
    listen(fd, 1024);
    int epfd = epoll_create1(0);
    struct epoll_event ev = { .data.fd = fd, .events = EPOLLIN | EPOLLET }; // change to EPOLLIN for LT mode
    epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
    struct epoll_event *events = calloc(MAXEVENTS, sizeof *events);
    for (int i = 0; i < PROCESS_NUM; ++i) {
        pid_t pid = fork();
        if (pid == 0) {
            while (1) {
                printf("I'm pid: %d, epoll on : %d
", getpid(), fd);
                int n = epoll_wait(epfd, events, MAXEVENTS, -1);
                if (n < 0) { perror("epoll_wait"); continue; }
                for (int j = 0; j < n; ++j) {
                    if (events[j].data.fd == fd) {
                        int cfd = accept(fd, NULL, NULL);
                        if (cfd >= 0) {
                            printf("new read event: accept new_fd: %d on pid: %d
", cfd, getpid());
                            close(cfd);
                        }
                    }
                }
            }
        }
    }
    wait(0);
    return 0;
}

Running the program in ET mode shows that the same process repeatedly handles connections because the exclusive wait‑queue keeps waking the first process. In LT mode, adding a short sleep() after epoll_wait() reveals that many processes are awakened, confirming the herd effect.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multithreadingepollthundering herdacceptLinux performance
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.