How Polar‑TCP Breaks Kernel Network Bottlenecks for Million‑IOPS Cloud Services
This article explains how traditional kernel network stacks struggle with modern cloud data‑center workloads and introduces Baidu Intelligent Cloud's Polar solution—Polar‑TCP and Polar‑RDMA—which combine user‑space DPDK drivers, a lightweight TCP stack, and an industrial‑grade RPC framework to achieve near‑RDMA performance while preserving ecosystem compatibility.
Traditional kernel network stacks were designed for generality, compatibility, and isolation, exposing BSD sockets to user space and isolating resources via system calls and memory separation. While this design ensured broad compatibility, it now creates severe performance bottlenecks in cloud data‑centers that demand high concurrency, bandwidth, and low latency.
Frequent system calls, context switches, multiple data copies, lock contention, and cache invalidation dramatically increase latency. The stack’s universal design also hinders deep optimization for specific high‑performance workloads such as distributed storage and databases.
To address these challenges, Baidu Intelligent Cloud introduced Polar, a solution that offers two core components: Polar‑TCP, a full user‑space link built on DPDK, a user‑space protocol stack, and the BRPC RPC framework, delivering near‑RDMA performance while remaining compatible with existing TCP ecosystems; and Polar‑RDMA, an optional hardware‑accelerated path for controlled network environments.
Key solution concepts: Data‑plane kernel bypass (user‑space stack + DPDK driver) + ecosystem compatibility (non‑TCP traffic still uses the kernel). Polling‑driven Run‑To‑Completion (RTC) per‑thread architecture for lock‑free parallel processing. Zero‑copy between RPC and protocol stack. Deep BRPC performance tuning. RDMA extension for selective path acceleration.
1. Industry Background and Challenges
In the cloud‑native era, network performance has become the ceiling for business workloads. High concurrency, bandwidth, and low‑latency demands amplify the latency caused by system calls, context switches, data copies, lock contention, and TLB misses.
To achieve high throughput and low latency, the data path must be shortened, context switches eliminated, and copies removed while preserving engineering usability.
2. Polar’s Birth: A Scalable Path Between Generality and Extreme Performance
2.1 Design Philosophy
Polar‑TCP builds a full user‑space chain: DPDK (user‑space driver) + user‑space TCP stack + BRPC. This achieves near‑RDMA throughput while maintaining compatibility with the existing TCP ecosystem.
DPDK directly accesses NICs, bypassing kernel interrupts and copies, and leverages hardware offloads such as checksum, TSO, and GRO.
A lightweight TCP stack uses a polling‑based RTC model, per‑thread resource ownership, and lock‑free processing.
BRPC is deeply optimized for zero‑copy, efficient serialization, and high‑performance RPC.
2.2.1 DPDK – Direct Hardware Access
DPDK eliminates kernel‑user data copies and reduces interrupt overhead, allowing packets to be processed entirely in user space.
2.2.2 Polar‑TCP – Breaking the Protocol Stack Bottleneck
Polar‑TCP ports FreeBSD TCP core logic to user space, applies polling + RTC, lock‑free processing, and end‑to‑end zero‑copy. It remains compatible with kernel TCP, enabling seamless migration.
Thread Model (Polling + RTC)
Multiple polling threads each run a set of non‑blocking poller functions, handling packets for specific five‑tuples on the same thread, eliminating cross‑core contention.
Zero‑Copy
Send side: application registers memory with Polar‑TCP, establishing virtual‑to‑DMA mappings; only pointers and lengths are transmitted, avoiding data copies.
Receive side: NIC delivers packets directly to BRPC as IOBuf objects, preserving zero‑copy through the stack.
Compatibility
Single‑node deployment fully compatible with kernel TCP, allowing gradual migration.
Provides polar_socket, polar_send, polar_recv APIs for minimal code changes.
Retains only essential FreeBSD TCP logic, delegating non‑TCP traffic to the kernel.
3. Engineering Deployment and Performance Results
3.1 Benchmarks
Tests on AMD EPYC 7W83, Linux 6.6, Mellanox ConnectX‑6 Dx. Polar‑TCP (p2p) achieves 4.9× higher 4 KB random write IOPS and 3.4× higher random read IOPS compared to kernel‑TCP (k2k). Latency is higher for small iodepth but acceptable for high‑throughput workloads.
3.2 Real‑World Use Cases
Cloud Disk Service (CDS): Polar‑TCP boosts 4 KB random write IOPS >2×, achieving million‑level IOPS on 8 cores.
Cloud File Storage (CFS): Polar‑RDMA raises QPS 16% and cuts latency >50%.
Distributed KV/Compute platform: containerized Polar‑TCP improves single‑core QPS >60%.
GaiaDB: Polar‑TCP increases QPS >2.5× and reduces latency >50%.
4. Future Outlook
Polar demonstrates that workload‑centric, software‑stack customization can unlock the full potential of modern hardware, offering both TCP and RDMA paths to let engineers choose the optimal trade‑off between compatibility and extreme performance, paving the way for scalable, high‑performance cloud‑native networking.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
