How Polar‑TCP Breaks Kernel Network Bottlenecks for Cloud‑Native High‑Performance Services
This article explains how traditional kernel network stacks struggle with high‑concurrency, low‑latency cloud data‑center workloads and introduces Baidu Intelligent Cloud’s Polar solution—Polar‑TCP and Polar‑RDMA—which combine user‑space DPDK drivers, a lightweight TCP stack, and an industrial RPC framework to achieve near‑RDMA performance while preserving compatibility with existing TCP ecosystems.
Industry Background and Challenges
The traditional kernel network protocol stack was designed for generality, compatibility, and isolation, exposing APIs like BSD sockets to user space. While this design ensures broad compatibility, it introduces frequent system calls, context switches, multiple data copies, lock contention, and cache invalidations, which become severe performance bottlenecks in modern cloud data‑center scenarios that demand high concurrency, high bandwidth, and low latency.
In cloud‑native environments, these bottlenecks prevent the kernel stack from being optimized for specific workloads such as distributed storage, databases, and AI services.
Solution keywords: data‑plane kernel bypass (user‑space stack + DPDK driver) + ecosystem compatibility (non‑TCP traffic still goes through the kernel) → performance breakthrough with ecosystem compatibility.
Polar’s Birth: A Scalable Path Between Generality and Extreme Performance
Design Idea
Polar provides two core components:
Polar‑TCP: builds a full user‑space link (DPDK driver + user‑space protocol stack + Baidu Remote Procedure Call – BRPC) to achieve near‑RDMA performance while remaining compatible with the existing TCP ecosystem.
Polar‑RDMA: offers a hardware‑accelerated path for controlled network environments that demand extreme performance.
Polar‑TCP is the primary focus, delivering kernel bypass, zero‑copy, and high‑throughput, low‑latency networking for cloud services.
Underlying Architecture
Polar‑TCP consists of three layers:
DPDK: direct NIC access, hardware offloads (TSO, GRO, checksum), and high‑speed packet I/O.
Polar‑TCP stack: a lightweight, user‑space TCP implementation based on FreeBSD, using a polling‑based Run‑To‑Completion (RTC) model, lock‑free processing, and end‑to‑end zero‑copy.
BRPC: an industrial RPC framework that integrates with the user‑space stack, providing zero‑copy I/O buffers (IOBuf) and optimized serialization.
DPDK – Direct Hardware Access
DPDK eliminates kernel interrupts and copies, allowing packets to flow directly between the NIC and user space, reducing latency and CPU load.
Polling + RTC Thread Model
Polar‑TCP uses multiple polling threads, each handling a set of poller functions. Each TCP connection is processed by a single thread, avoiding cross‑core contention and locks.
Lock‑Free Processing
Per‑thread port allocation and flow‑affinity ensure that packets of the same 5‑tuple are always handled by the same thread.
Shared‑nothing design gives each thread exclusive resources, eliminating global locks.
End‑to‑End Zero‑Copy
On the send path, applications register memory with Polar‑TCP, creating a DMA mapping; only pointers and lengths are passed to the NIC, avoiding data copies. On the receive path, packets are wrapped in IOBuf objects and delivered directly to BRPC without extra copies.
Event Notification Mechanism
Polar‑TCP generates events when a connection’s send buffer becomes writable or when data arrives, allowing the polling threads to process only active connections.
Compatibility Support
Single‑side deployment with full compatibility to kernel TCP, enabling seamless migration.
Provides socket‑like APIs (polar_socket, polar_send, polar_recv) for minimal code changes.
Retains core FreeBSD TCP logic while delegating non‑TCP traffic to the kernel.
BRPC – High‑Performance Application Delivery
BRPC is re‑engineered to align with Polar‑TCP’s polling model, using IOBuf chains for zero‑copy data flow, optimized serialization (FlatBuffers), and streamlined health‑check logic. It also supports a Polar‑RDMA path for controlled networks.
Performance Results
Benchmarks on AMD EPYC 7W83 with Mellanox ConnectX‑6 Dx show:
4 KB random write IOPS: Polar‑TCP (p2p) achieves 4.9× the kernel‑TCP baseline.
4 KB random read IOPS: Polar‑TCP (p2p) achieves 3.4× the kernel‑TCP baseline.
Average latency for 4 KB operations is higher than kernel‑TCP due to deeper stacks, but throughput gains are significant.
Real‑world deployments demonstrate:
CDS (Cloud Disk Service): >2× IOPS improvement, reaching millions of IOPS on 8 cores.
CFS (Cloud File Storage) with Polar‑RDMA: 16% QPS increase, >50% latency reduction.
Distributed KV/compute platform (XBOX): >60% QPS boost on a single core.
GaiaDB (cloud database): >2.5× QPS and >50% latency reduction.
Future Outlook
Polar exemplifies a workload‑centric approach: instead of a one‑size‑fits‑all kernel stack, it provides a specialized, high‑performance path for latency‑sensitive services while preserving compatibility for broader workloads. Ongoing work will extend Polar to new DPUs, cloud‑native databases, and large‑model AI training, making high‑performance networking a ubiquitous foundation for intelligent applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
