Operations 16 min read

Comprehensive Survey of Large‑Scale RDMA Technologies and Practices

This article provides a detailed overview of large‑scale RDMA technology, covering basic concepts, major protocols, network‑level techniques such as congestion control, lossless‑to‑lossy evolution and multipath, virtualization, communication libraries for AI training and storage, performance tuning, monitoring, and real‑world deployment experiences.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Comprehensive Survey of Large‑Scale RDMA Technologies and Practices

Recent efforts aim to systematically organize large‑scale RDMA technologies used in industry or with strong potential, offering a reference for practitioners and a way to clarify the overall landscape.

1. RDMA Basic Concepts

Introduces three RDMA protocols (InfiniBand, RoCE (v2), iWARP), compares RDMA with TCP (protocol offload, zero copy, OS bypass), describes RDMA primitives (Send, Receive, RDMA Read, RDMA Write, Atomic), transport types (RC, UC, UD) and core concepts (SR, RR, CQ, MR, MW, PD), with a reference to the RDMA Aware Networks Programming User Manual and a simple RDMA tutorial.

2. Key Network Technologies for Large‑Scale RDMA

2.1 Congestion‑Control Algorithms

Traditional RDMA relies on Priority Flow Control (PFC) to avoid packet loss, but PFC’s coarse granularity can cause congestion spreading, deadlock, and unfairness. Per‑flow congestion‑control algorithms are essential to mitigate these issues.

DCQCN – rate‑based algorithm used by Mellanox, based on ECN marks (SIGCOMM 2015).

TIMELY and Swift – RTT‑based, rate‑based algorithms deployed in Google data centers (SIGCOMM 2016, 2020).

HPCC – window‑based algorithm using INT, proposed by Alibaba (SIGCOMM 2019).

Other vendor algorithms – variants of DCTCP on Broadcom NICs.

2.2 Evolution from Lossless to Lossy Networks

RDMA originally ran over InfiniBand with credit‑based flow control. In data‑center IP networks (RoCE v2), lossless operation is required because older NICs handle packet loss poorly. PFC enables lossless behavior but suffers from the same coarse‑grained problems; newer NICs (CX‑6, CX‑7) add efficient loss‑recovery and end‑to‑end flow control, making lossy networks viable (SIGCOMM 2018).

2.3 RDMA Multipath Transmission

Fat‑Tree/Clos topologies provide multiple equal‑cost paths, but ECMP’s stateless hashing can cause path collisions. Solutions include:

Hardware‑assisted multipath RDMA (Microsoft research, NSDI 2018).

Software‑managed sub‑flows (flowlet/flowcell) with reordering handled in drivers (Amazon SRD).

Intelligent routing based on workload characteristics (e.g., AI training traffic).

3. RDMA Virtualization

Key works for cloud deployment: Microsoft FreeFlow (NSDI 2019) and Huawei MasQ (SIGCOMM 2020).

4. RDMA Communication Libraries

4.1 AI Training

Collective communication libraries such as NVIDIA NCCL, Facebook Gloo, and vendor‑specific variants (Google, Microsoft, AWS OFI, Huawei HCCL, Alibaba ACCL) enable high‑performance training.

4.2 Storage

Middleware examples include Alibaba X‑RDMA (CLUSTER 2019) and NVMe‑over‑Fabric.

4.3 Practical Tips

Various tricks for key‑value services, remote memory, and in‑memory transaction processing (e.g., FaRM, Fast In‑memory Transaction Processing) are highlighted.

5. RDMA Performance Optimization, Monitoring, and Operations

5.1 Performance Optimization

RDMA breaks traditional layering, coupling application logic with NIC verbs, exposing PCIe, OS scheduling, and cache management bottlenecks. Tools like Collie (NSDI 2022) use simulated annealing to locate performance anomalies. Practical advice emphasizes NIC selection, cache management, and cross‑layer joint optimization.

5.2 Monitoring and Operations

Beyond second‑level TCP monitoring, RDMA requires millisecond‑level metrics (PFC, ECN, CNP, NACK) for reliable operation.

6. Industry RDMA Approaches

Amazon SRD/EFA (IEEE Micro 2020) and other large‑scale deployments.

7. Deployment Experiences

7.1 Microsoft

Early large‑scale RoCE v2 deployment, discussing transport livelock, PFC deadlock, pause storms, and slow‑receiver symptoms (SIGCOMM 2016).

7.2 Alibaba

RDMA use in the Pangu storage cluster (NSDI 2021).

8. Summary and Outlook

RDMA has shown huge potential in AI training and high‑performance storage. Future challenges include pod‑to‑pod and full‑network RDMA scaling, seamless migration of existing applications, and low‑overhead diagnostics and monitoring.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceAInetworkstorageVirtualizationRDMA
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.