Building a High‑Performance HTTP Server with Multi‑Threaded epoll and UDP Load Balancing

This article details a competition‑level solution that implements a high‑throughput HTTP server using a custom multi‑threaded epoll model, UDP‑based protocol conversion, sparse indexing, and optimized write‑read schemes to achieve top performance on limited‑resource hardware.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Building a High‑Performance HTTP Server with Multi‑Threaded epoll and UDP Load Balancing

The fourth Alibaba Middleware Performance Challenge, co‑hosted by Alibaba Cloud Tianchi, required participants to design a high‑performance HTTP server and a massive‑scale queue storage system. The author, a 1990s‑born developer from Nanjing University of Science and Technology, shares the winning approach.

Preliminary Round – Problem Restatement

The task was to implement a high‑performance HTTP server with an efficient load‑balancing algorithm and simple protocol conversion. Standard HTTP frameworks were unsuitable due to shared Docker resources; a custom solution was needed.

Network Programming Model Choice

epoll was selected for its scalability under high connection counts, avoiding the linear overhead of select/poll. Existing async libraries (libuv, libevent) use a one‑loop‑per‑thread model, which can cause load imbalance. The author designed a multi‑threaded epoll where multiple threads share a single epoll instance, allowing the OS scheduler to balance requests.

Overall Architecture

The Consumer Agent parses HTTP requests, converts them to Dubbo requests, and sends them via UDP to the Provider Agent. The Provider Agent forwards the UDP payload to a Dubbo service over TCP, receives the response, and returns it to the Consumer Agent, which finally replies to the client in HTTP.

Middle‑Protocol Design

UDP was chosen for its low latency and high throughput. Each request is a single‑packet Dubbo request; responses are single‑packet with multiple replies. Because packets are atomic, ordering issues are ignored, and the system can reassemble UDP packets directly into TCP streams.

Load‑Balancing Algorithm

A counter‑based algorithm sends each request to the Provider whose current load deviates most from the configured ratio. If all three providers exceed a load threshold (200), requests are queued and prioritized when a provider becomes available.

Innovations and Engineering Value

Consumer Agent implements load balancing and request throttling to prevent overload propagation.

UDP transport eliminates connection overhead; single‑packet semantics tolerate loss without protocol corruption.

Request IDs embed caller info and timestamps for precise performance metrics.

Multi‑threaded epoll in edge‑triggered mode maximizes NIC utilization and eliminates per‑thread queues.

Appropriate TCP buffer sizing ensures non‑blocking writes.

Semifinal Round – Problem Restatement

The second stage required storing and retrieving one million queues within a single process on a 4‑CPU, 8‑GB machine, demanding efficient on‑disk layout, sparse indexing, and high‑throughput sequential writes and random reads.

System Architecture

The design separates queue objects, a name‑to‑queue map, a file manager, and a control module that switches read/write states.

Message Storage Scheme

Messages are stored in fixed‑size blocks (4 KB). Each message consists of a variable‑length integer indicating the remaining payload size followed by the payload, allowing cross‑block messages and arbitrary lengths.

Write Schemes

1. mmap (Java) – Discarded : Used reference counting for mmap blocks, avoiding flushes but suffered from page‑cache back‑pressure, limiting block size and performance.

2. Whole‑Memory Reference (Java) – Discarded : Similar to mmap but with an auxiliary thread flushing buffers; achieved 1.67 M writes but still constrained by memory.

3. Fragmented Memory Pool (C++) – Best Result : Blocks are split into slices; slices are gathered via iovec and written in larger parts. Slices are returned to the pool after write, enabling fine‑grained allocation and high throughput.

The ring‑buffer implementation reuses objects without allocation overhead.

Index Scheme

A sparse index records only block offset, message count per block, and whether the last message crosses the block. Locating a message involves accumulating counts to find the block, reading the block (4 KB), and skipping within‑block messages based on the cross‑block flag.

Random and Sequential Reads

Random reads often fit within a single block, allowing one I/O operation. Sequential reads trigger read‑ahead for the next block, improving throughput by ~40% because each 4 KB block contains only relevant data.

Map Container Optimization

Queue names are mapped to objects via a suffix‑aware hash that reduces collisions. The hash space is split into 32 sharded unordered_map instances protected by spin locks, achieving hundreds of milliseconds faster insertion and tens of milliseconds faster lookup for one million entries.

Read/Write State Switching

Initially the system kept large in‑memory buffers, but this hurt read performance. The solution flushes all buffers to disk before the first read, releasing ~5.5 GB of RAM, then switches back to write mode when needed.

Key Engineering Contributions

Fragmented memory pool + iovec enables simultaneous memory and disk writes under tight memory constraints.

Suffix‑aware hash map drastically reduces collision probability for numerically suffixed queue names.

Batching writes into large blocks leverages SSD sequential write performance.

Variable‑length integer + cross‑block handling supports arbitrarily long messages.

Sparse index occupies only ~166 MB for one million blocks, fitting comfortably in RAM.

Ring‑buffer based object reuse maximizes CPU cache efficiency.

All parameters (buffer sizes, pool sizes, etc.) are configurable via macros for different deployment scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

load balancingbackend optimizationepollUDPmemory poolhigh-performance serversparse indexing
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.