Cloud Computing 11 min read

How Baidu’s UNP Programmable Gateway Boosts Load‑Balancing to Tbps Speeds

The article analyzes the limitations of traditional X86‑based software load‑balancing gateways and presents Baidu Cloud’s third‑generation UNP programmable platform, detailing its heterogeneous architecture, fast‑path/slow‑path processing, performance gains, a real‑world case study, and future roadmap.

Baidu Geek Talk

May 22, 2023

How Baidu’s UNP Programmable Gateway Boosts Load‑Balancing to Tbps Speeds

Background

Load‑balancing gateways are a critical infrastructure component in cloud networks, providing high‑performance packet forwarding for various services. Historically, most cloud gateways have been built on X86 CPUs combined with DPDK on general‑purpose servers. Baidu Cloud’s BGW (BaiduGateWay) has evolved from a single‑machine 10 Gbps solution in 2012 to a 200 Gbps single‑machine design, becoming one of the most widely used gateways in cloud environments.

Challenges of the Existing X86 Software Gateway

Single‑core processing limits: To avoid packet reordering, a flow must be scheduled to the same CPU core, but single‑core performance has plateaued, capping per‑flow throughput at roughly 10‑20 Gbps even on the latest CPUs. When multiple high‑volume flows share a core, contention reduces overall throughput and can cause probabilistic packet loss.

Latency instability: Software processing adds significant latency compared to hardware forwarding. The packet path includes NIC reception, PCIe transfer to the CPU, DPDK driver processing, application logic, and return via PCIe. Measured average latency is 30‑50 µs under normal load, with tails exceeding 100 µs under heavy load, and occasional millisecond‑scale spikes.

High total cost of ownership (TCO) for large‑bandwidth scenarios: Adding CPU cores does not linearly increase throughput because the gateway’s I/O‑bound nature is limited by cache architecture (e.g., L3 cache). A 64‑core AMD Milan server shows diminishing returns beyond 32 cores, and scaling to 10 Tbps would require 50‑100 servers.

Solution: UNP (Universal Networking Platform)

To meet growing demands, Baidu Cloud introduced the third‑generation programmable gateway platform UNP, which integrates X86 CPUs, programmable ASIC switches, and FPGA acceleration cards into a scalable heterogeneous gateway.

Programmable ASIC provides terabit‑level bandwidth.

Hybrid hardware‑software design supports both hardware‑based and traditional software gateways, offering flexibility and hyper‑convergence.

Expandable slots allow additional hardware acceleration.

In January 2023, Baidu released UNP‑BGW 1.0, a programmable load‑balancing gateway that addresses bandwidth, latency, and packet‑loss challenges.

Architecture Overview

The UNP‑BGW consists of two main parts: the X86 gateway and the programmable ASIC switch. The X86 side continues to use DPDK for control plane, routing, session management, and non‑offloaded traffic, effectively acting as a dual‑NUMA X86‑BGW.

Two NICs appear as standard network interfaces in user space and connect directly to the programmable switch. Virtual network devices (Vnic0‑VnicN) generated by the ASIC driver handle both routing packet I/O and packet capture for diagnostics.

Fast‑Path / Slow‑Path Processing

Fast‑Path: Sessions that hit the ASIC are forwarded in hardware, delivering terabit‑scale throughput and microsecond‑level latency.

Slow‑Path: Missed sessions are sent to the CPU, where policy decisions determine whether to create a new session and offload it to the fast‑path.

When a new flow arrives, the ASIC checks for an existing session; if none exists, the packet follows the slow‑path to the CPU for session creation. Periodically, the BGW evaluates session statistics; flows exceeding bandwidth or packet‑per‑second thresholds are classified as “elephant flows” and offloaded to the ASIC.

Sessions are aged out by the CPU when flows terminate or become idle, freeing hardware resources.

Performance Highlights of UNP‑BGW 1.0

Capacity: Single‑machine bandwidth increased >5×, from 200 Gbps to >1 Tbps.

Latency: Average forwarding latency reduced >20×; under high load, 100 µs tail latency drops to <4 µs with no jitter.

Packet loss: Reduced from 10⁻⁴ to 10⁻⁸, dramatically improving reliability.

Cost: Higher per‑machine throughput lowers the number of required servers, cutting deployment cost.

Power: Fewer servers for the same throughput cut overall energy consumption by >50 %, contributing to carbon reduction.

Typical Use Case

A storage‑intensive customer with a 15 Gbps “elephant flow” experienced 90 % CPU utilization on an X86‑BGW cluster, impacting other services. After migrating to UNP‑BGW, the same flow achieved 16 Gbps while CPU usage dropped below 1 %.

UNP‑BGW 1.0 is already used to accelerate Baidu Object Storage (BOS) services.

Future Directions

Current ASIC session tables are limited to a few hundred megabytes; adding FPGA acceleration cards can expand table capacity. Baidu Cloud is preparing UNP‑BGW 2.0 with higher offload capabilities to support millions of concurrent sessions and multi‑terabit bandwidth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Network Architecture Cloud Computing load balancing FPGA ASIC programmable gateway

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.