Understanding RDMA High‑Performance Networks: Principles, Benefits, and Applications in Machine Learning
The article explains the background, architecture, and performance advantages of RDMA high‑performance networking, compares it with traditional TCP/IP, describes its deployment at Baidu for machine‑learning workloads, and outlines future use cases such as storage acceleration, GPU communication, and core services.
With the explosive growth of data, many fields demand higher computational hardware performance. While CPUs and GPUs have advanced rapidly, traditional Ethernet has lagged and become a performance bottleneck, prompting a push for higher‑bandwidth, lower‑latency networks.
For engineers on the front line, common frustrations include long data‑transfer times, servers waiting on network I/O, and the perception that the network is holding back overall performance.
This article introduces RDMA (Remote Direct Memory Access) high‑performance networking, analyzes its performance benefits, showcases successful machine‑learning applications, and discusses future prospects.
What is RDMA high‑performance networking?
Traditional networking concepts like the OSI model, HTTP, TCP, and IP are familiar, but RDMA represents a different approach. First, DMA (Direct Memory Access) allows external devices to transfer data directly to/from host memory after address translation, freeing CPU resources. RDMA extends this idea: a server’s NIC can read/write memory on another server directly, achieving high bandwidth, low latency, and low CPU usage without involving the application in the data‑transfer process.
Key performance advantages of RDMA include:
Zero Copy: Eliminates data copies between user space and kernel space, significantly reducing transmission latency.
Kernel Bypass and Protocol Offload: Removes kernel processing of packet headers, further lowering latency and CPU consumption.
RDMA is not new; the earliest implementation, InfiniBand, has long been used in high‑performance computing. However, InfiniBand requires specialized, expensive hardware. Ethernet‑compatible RDMA protocols such as RoCE (RDMA over Converged Ethernet) and iWARP have emerged. After evaluating performance and availability, the author’s team adopted RoCE, which offers InfiniBand‑level performance while leveraging standard Ethernet and IP, simplifying deployment and reducing cost.
Why high‑performance networking matters
In modern systems, the network sits between storage (disk) and main memory. As server memory sizes increase, the network increasingly becomes the performance bottleneck. Compared with 10 GbE, a 40 Gb RDMA network can improve bandwidth by an order of magnitude and reduce small‑packet latency by one to two orders of magnitude.
How to use a high‑performance network
Most existing applications require porting to take advantage of RDMA; they cannot simply replace hardware and run unchanged code. While MPI (Message Passing Interface) is widely used in HPC, its applicability to many business workloads is limited. To ease adoption, the team co‑developed a socket‑like programming interface that retains RDMA’s performance benefits while simplifying migration.
RDMA deployment at Baidu
Since around 2014, Baidu has deployed both InfiniBand and RoCE clusters. In 2015, large‑scale RoCEv2 clusters were rolled out in the SZWG and YQ01 data centers, supporting deep‑learning, speech‑recognition, and natural‑language‑processing tasks. Today, roughly 600 servers form the largest RoCEv2 network in China.
RDMA’s high bandwidth, low latency, and low CPU usage have proven valuable for machine‑learning workloads. For example, OpenMPI over 40 Gb RDMA yields a 10‑fold speedup over 10 Gb TCP for speech‑recognition training and NLP translation. Paddle‑based image training using the custom socket library also sees significant gains, with OpenMPI benchmarks showing 1‑2 orders of magnitude improvement.
Future outlook for RDMA beyond machine learning
The team is exploring additional use cases:
Accelerating storage systems (e.g., iSCSI, Samba, NVMe, Hadoop) by leveraging RDMA’s high bandwidth and low latency while offloading CPU work.
Speeding up GPU heterogeneous‑compute communication through zero‑copy, reducing inter‑GPU transfer latency.
Boosting core services, as RDMA integrates seamlessly with existing Ethernet/IP infrastructure, enabling broader adoption across the company’s services.
GDR (GPU Direct RDMA) has been under development since 2014 and is now maturing. Recent tests show that OpenMPI + GDR dramatically reduces cross‑node GPU transfer latency and approaches the theoretical bandwidth limit, promising further acceleration for heterogeneous computing.
In the coming months, Baidu plans to gradually roll out RDMA acceleration to more business lines, with the systems team providing easier‑to‑use network interfaces for all services.
Recommended reading:
InfiniBand Architecture and Technical Practice Summary
RDMA Technical Principles and Comparison of Three Main Implementations
Detailed Explanation of RDMA (Remote Direct Memory Access) Architecture
Warm tip:
Please scan the QR code to follow the public account and click the original link for more technical resources and articles.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.