Cloud Native 23 min read

How Polar‑TCP Breaks Kernel Network Bottlenecks for Cloud‑Native High‑Performance Services

This article explains how traditional kernel network stacks struggle with high‑concurrency, low‑latency cloud data‑center workloads and introduces Baidu Intelligent Cloud’s Polar solution—Polar‑TCP and Polar‑RDMA—which combine user‑space DPDK drivers, a lightweight TCP stack, and an industrial RPC framework to achieve near‑RDMA performance while preserving compatibility with existing TCP ecosystems.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
How Polar‑TCP Breaks Kernel Network Bottlenecks for Cloud‑Native High‑Performance Services

Industry Background and Challenges

The traditional kernel network protocol stack was designed for generality, compatibility, and isolation, exposing APIs like BSD sockets to user space. While this design ensures broad compatibility, it introduces frequent system calls, context switches, multiple data copies, lock contention, and cache invalidations, which become severe performance bottlenecks in modern cloud data‑center scenarios that demand high concurrency, high bandwidth, and low latency.

In cloud‑native environments, these bottlenecks prevent the kernel stack from being optimized for specific workloads such as distributed storage, databases, and AI services.

Solution keywords: data‑plane kernel bypass (user‑space stack + DPDK driver) + ecosystem compatibility (non‑TCP traffic still goes through the kernel) → performance breakthrough with ecosystem compatibility.

Polar’s Birth: A Scalable Path Between Generality and Extreme Performance

Design Idea

Polar provides two core components:

Polar‑TCP: builds a full user‑space link (DPDK driver + user‑space protocol stack + Baidu Remote Procedure Call – BRPC) to achieve near‑RDMA performance while remaining compatible with the existing TCP ecosystem.

Polar‑RDMA: offers a hardware‑accelerated path for controlled network environments that demand extreme performance.

Polar‑TCP is the primary focus, delivering kernel bypass, zero‑copy, and high‑throughput, low‑latency networking for cloud services.

Underlying Architecture

Polar‑TCP consists of three layers:

DPDK: direct NIC access, hardware offloads (TSO, GRO, checksum), and high‑speed packet I/O.

Polar‑TCP stack: a lightweight, user‑space TCP implementation based on FreeBSD, using a polling‑based Run‑To‑Completion (RTC) model, lock‑free processing, and end‑to‑end zero‑copy.

BRPC: an industrial RPC framework that integrates with the user‑space stack, providing zero‑copy I/O buffers (IOBuf) and optimized serialization.

DPDK – Direct Hardware Access

DPDK eliminates kernel interrupts and copies, allowing packets to flow directly between the NIC and user space, reducing latency and CPU load.

Polling + RTC Thread Model

Polar‑TCP uses multiple polling threads, each handling a set of poller functions. Each TCP connection is processed by a single thread, avoiding cross‑core contention and locks.

Lock‑Free Processing

Per‑thread port allocation and flow‑affinity ensure that packets of the same 5‑tuple are always handled by the same thread.

Shared‑nothing design gives each thread exclusive resources, eliminating global locks.

End‑to‑End Zero‑Copy

On the send path, applications register memory with Polar‑TCP, creating a DMA mapping; only pointers and lengths are passed to the NIC, avoiding data copies. On the receive path, packets are wrapped in IOBuf objects and delivered directly to BRPC without extra copies.

Event Notification Mechanism

Polar‑TCP generates events when a connection’s send buffer becomes writable or when data arrives, allowing the polling threads to process only active connections.

Compatibility Support

Single‑side deployment with full compatibility to kernel TCP, enabling seamless migration.

Provides socket‑like APIs (polar_socket, polar_send, polar_recv) for minimal code changes.

Retains core FreeBSD TCP logic while delegating non‑TCP traffic to the kernel.

BRPC – High‑Performance Application Delivery

BRPC is re‑engineered to align with Polar‑TCP’s polling model, using IOBuf chains for zero‑copy data flow, optimized serialization (FlatBuffers), and streamlined health‑check logic. It also supports a Polar‑RDMA path for controlled networks.

Performance Results

Benchmarks on AMD EPYC 7W83 with Mellanox ConnectX‑6 Dx show:

4 KB random write IOPS: Polar‑TCP (p2p) achieves 4.9× the kernel‑TCP baseline.

4 KB random read IOPS: Polar‑TCP (p2p) achieves 3.4× the kernel‑TCP baseline.

Average latency for 4 KB operations is higher than kernel‑TCP due to deeper stacks, but throughput gains are significant.

Real‑world deployments demonstrate:

CDS (Cloud Disk Service): >2× IOPS improvement, reaching millions of IOPS on 8 cores.

CFS (Cloud File Storage) with Polar‑RDMA: 16% QPS increase, >50% latency reduction.

Distributed KV/compute platform (XBOX): >60% QPS boost on a single core.

GaiaDB (cloud database): >2.5× QPS and >50% latency reduction.

Future Outlook

Polar exemplifies a workload‑centric approach: instead of a one‑size‑fits‑all kernel stack, it provides a specialized, high‑performance path for latency‑sensitive services while preserving compatibility for broader workloads. Ongoing work will extend Polar to new DPUs, cloud‑native databases, and large‑model AI training, making high‑performance networking a ubiquitous foundation for intelligent applications.

Polar architecture diagram
Polar architecture diagram
Performance optimizationcloud-nativeDPDKRDMANetwork Stackuser-space networkingPolar TCP
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.