How Baidu’s UNP Platform Supercharges Load‑Balancing to 1 Tbps
This article explains the limitations of traditional X86‑DPDK load‑balancing gateways and how Baidu’s third‑generation Universal Networking Platform (UNP) combines programmable ASICs, CPUs, and FPGA acceleration to deliver multi‑terabit throughput, ultra‑low latency, and dramatically lower cost and power consumption.
Background
Load‑balancing gateways are a critical cloud networking infrastructure that provide high‑performance forwarding for various application services. Most cloud gateways are built on X86 CPUs with DPDK on general‑purpose servers. Baidu Intelligent Cloud’s BGW has evolved from a single‑machine 10 Gbps throughput in 2012 to 200 Gbps today, becoming one of the most widely used gateways.
With growing business demands, several challenges have emerged:
Single‑core compute limits: To avoid packet reordering, a flow must be processed on the same CPU core, but single‑core performance has plateaued at 10‑20 Gbps under ideal conditions. Multiple high‑volume flows sharing a core cause contention, reducing overall throughput and potentially causing packet loss.
Unstable latency: Software processing incurs higher latency than hardware forwarding. A packet traverses NIC → PCIe → DPDK driver → gateway software → DPDK driver → NIC. Measured average latency is 30‑50 µs under normal load, with tails exceeding 100 µs under high load, and occasional ms‑level spikes.
High TCO for large‑bandwidth scenarios: Adding more CPU cores does not linearly increase throughput due to I/O bottlenecks and cache limitations. Even with 64‑core AMD Milan servers, scaling beyond 32 cores yields little gain, and achieving 10 Tbps would require 50‑100 servers.
Consequently, a pure X86 software gateway cannot meet the increasing demand for higher throughput, lower latency, and reduced packet loss.
Solution
Baidu Intelligent Cloud introduced the third‑generation programmable gateway platform – UNP (Universal Networking Platform). UNP fuses X86 CPUs, programmable switch ASICs, and FPGA accelerator cards into an extensible heterogeneous gateway. Compared with the traditional X86 software gateway, UNP offers:
ASIC‑level bandwidth (terabit class) for fast‑path forwarding.
Hybrid operation: hardware switch + X86 CPU supports both hardware and software gateway functions, providing flexibility and hyper‑convergence.
Expandable slots for additional hardware acceleration.
In January 2023, Baidu launched the programmable load‑balancing UNP‑BGW 1.0, which addresses large‑bandwidth, large‑flow, and low‑latency requirements.
Architecture
The UNP‑BGW 1.0 consists of two parts: the X86 gateway and the programmable switch ASIC.
The X86 side continues to use DPDK for control plane, routing, session management, and non‑offloaded packet forwarding, effectively acting as a dual‑NUMA X86‑BGW.
Two NICs appear as standard network cards in user space and connect directly to the programmable ASIC. Virtual NICs (Vnic0‑VnicN) generated by the ASIC driver serve for routing packet I/O and packet capture.
Packet forwarding follows two paths:
Fast‑Path: Sessions that hit the ASIC are forwarded in hardware, delivering terabit‑class throughput and microsecond‑level latency.
Slow‑Path: Missed sessions are sent to the CPU for processing; based on configuration, sessions may later be offloaded to the fast‑path.
When a new flow arrives, the ASIC checks for an existing session. If none exists, the packet follows the Slow‑Path to the CPU for session creation. Periodically, BGW evaluates session traffic; once bandwidth or packet‑per‑second thresholds are met, the session is offloaded to the ASIC for hardware forwarding. Idle sessions are aged out by the CPU to free hardware resources.
Product Benefits
Capacity: Single‑machine bandwidth increased >5×, from 200 Gbps to >1 Tbps.
Latency: Average forwarding latency reduced >20×; under high load, latency drops from ~100 µs to <4 µs with no jitter.
Packet loss: Reduced from 10⁻⁵ to 10⁻⁹, greatly improving reliability.
Cost: Higher per‑machine throughput lowers the number of required servers.
Power: Fewer machines for terabit throughput cut overall energy consumption by >50%, contributing to carbon reduction.
Typical Case
A storage service customer frequently generated a single “elephant flow” (~15 Gbps). Using an X86‑BGW cluster, the gateway’s CPU utilization reached 90%, throttling other traffic.
After switching to UNP‑BGW, the same flow achieved 16 Gbps while CPU usage dropped below 1%.
Deployment & Future
UNP‑BGW 1.0 is already used to accelerate Baidu Object Storage (BOS) services. The ASIC’s session table is limited to a few hundred megabytes; adding FPGA accelerator cards can expand table capacity. Baidu is preparing UNP‑BGW 2.0 with higher offload capability to support millions of sessions and multi‑terabit bandwidth.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
