Operations 21 min read

Mastering C10K: Modern Techniques to Scale Server Concurrency

This article reviews the historical C10K challenge, explains IO model improvements like epoll, kqueue and IOCP, and details practical Linux performance optimizations such as CPU and memory affinity, RSS/RPS/RFS/XPS, IRQ handling, kernel tuning, and hardware utilization for high‑concurrency servers.

ITFLY8 Architecture Home

Apr 25, 2018

Mastering C10K: Modern Techniques to Scale Server Concurrency

C10K Era Problems and Optimization Techniques

First, we revisit the issues encountered during the original C10K era and the improvements made to increase single‑machine concurrency. Although the early Chinese Internet was not heavily impacted, the global community recognized the need to optimize IO models, leading to solutions such as epoll, kqueue, and IOCP.

These IO models gave rise to libraries and frameworks like libevent and libev, and enabled high‑performance servers such as Nginx, HAProxy and Squid, which remain essential for handling massive concurrent traffic today.

CPU Affinity & Memory Locality

Modern x86 servers are multi‑core and often use NUMA architectures. Without explicit binding, the OS may schedule a task on different cores or memory nodes, causing context switches and latency. Using sched_set_affinity, numactl or taskset can keep a task and its data on the same core and memory node, reducing overhead.

RSS, RPS, RFS, XPS

These Linux networking features distribute packet processing across multiple CPU cores. RSS requires hardware support (multi‑queue NICs), while RPS/RFS provide software‑based flow steering on older NICs, and XPS optimizes outbound traffic mapping. They improve network throughput on multi‑core systems.

IRQ Optimization

Interrupt coalescing and NAPI reduce the number of interrupts generated by high‑rate network traffic. IRQ affinity binds NIC queue interrupts to specific CPUs, similar to CPU affinity. Offload features such as TSO, GSO, LRO, and GRO move packet segmentation or aggregation to the NIC, further lowering CPU load.

Kernel Tuning

Key kernel parameters in net.ipv4.* and net.core.* control timeouts and buffers. Adjusting them can improve performance, but changes should be tested against the kernel documentation to avoid side effects.

Deeper Exploration and Practice

Beyond software, understanding hardware is crucial. Modern Intel Xeon CPUs now offer up to 22 cores per socket, large LLC caches, and high memory bandwidth. Multi‑queue NICs (e.g., Intel X710) provide RSS, virtual functions, SR‑IOV, and various offload capabilities.

Effective use of hardware involves three key areas: packet processing, task scheduling, and memory handling.

By mapping received packets directly into user‑space memory (e.g., via UIO), applications can bypass the kernel network stack, though this requires custom protocol handling for compatibility with existing socket APIs.

Linux kernel improvements such as SO_REUSEPORT allow multiple processes to listen on the same port, reducing contention on global receive queues. However, global connection tables can still become bottlenecks under massive TCP loads.

Optimizing task placement—keeping a connection’s processing on a single core—improves cache utilization and reduces cross‑core traffic. Techniques like lock‑free data structures and flattening pointer hierarchies further minimize cache misses.

Using huge pages mitigates TLB misses for large memory footprints, while pre‑allocating memory pools reduces allocation overhead for high‑throughput services.

Overall, combining kernel‑level tuning, hardware‑aware programming, and careful resource scheduling enables servers to handle the C10K (and beyond) workload efficiently.

Source: http://geek.csdn.net/news/detail/57010

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High concurrency Linux kernel Network Performance IO Multiplexing C10K CPU affinity

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.