Comprehensive Survey of Large‑Scale RDMA Technologies and Practices
This article provides a detailed overview of large‑scale RDMA technology, covering basic concepts, major protocols, network‑level techniques such as congestion control, lossless‑to‑lossy evolution and multipath, virtualization, communication libraries for AI training and storage, performance tuning, monitoring, and real‑world deployment experiences.
Recent efforts aim to systematically organize large‑scale RDMA technologies used in industry or with strong potential, offering a reference for practitioners and a way to clarify the overall landscape.
1. RDMA Basic Concepts
Introduces three RDMA protocols (InfiniBand, RoCE (v2), iWARP), compares RDMA with TCP (protocol offload, zero copy, OS bypass), describes RDMA primitives (Send, Receive, RDMA Read, RDMA Write, Atomic), transport types (RC, UC, UD) and core concepts (SR, RR, CQ, MR, MW, PD), with a reference to the RDMA Aware Networks Programming User Manual and a simple RDMA tutorial.
2. Key Network Technologies for Large‑Scale RDMA
2.1 Congestion‑Control Algorithms
Traditional RDMA relies on Priority Flow Control (PFC) to avoid packet loss, but PFC’s coarse granularity can cause congestion spreading, deadlock, and unfairness. Per‑flow congestion‑control algorithms are essential to mitigate these issues.
DCQCN – rate‑based algorithm used by Mellanox, based on ECN marks (SIGCOMM 2015).
TIMELY and Swift – RTT‑based, rate‑based algorithms deployed in Google data centers (SIGCOMM 2016, 2020).
HPCC – window‑based algorithm using INT, proposed by Alibaba (SIGCOMM 2019).
Other vendor algorithms – variants of DCTCP on Broadcom NICs.
2.2 Evolution from Lossless to Lossy Networks
RDMA originally ran over InfiniBand with credit‑based flow control. In data‑center IP networks (RoCE v2), lossless operation is required because older NICs handle packet loss poorly. PFC enables lossless behavior but suffers from the same coarse‑grained problems; newer NICs (CX‑6, CX‑7) add efficient loss‑recovery and end‑to‑end flow control, making lossy networks viable (SIGCOMM 2018).
2.3 RDMA Multipath Transmission
Fat‑Tree/Clos topologies provide multiple equal‑cost paths, but ECMP’s stateless hashing can cause path collisions. Solutions include:
Hardware‑assisted multipath RDMA (Microsoft research, NSDI 2018).
Software‑managed sub‑flows (flowlet/flowcell) with reordering handled in drivers (Amazon SRD).
Intelligent routing based on workload characteristics (e.g., AI training traffic).
3. RDMA Virtualization
Key works for cloud deployment: Microsoft FreeFlow (NSDI 2019) and Huawei MasQ (SIGCOMM 2020).
4. RDMA Communication Libraries
4.1 AI Training
Collective communication libraries such as NVIDIA NCCL, Facebook Gloo, and vendor‑specific variants (Google, Microsoft, AWS OFI, Huawei HCCL, Alibaba ACCL) enable high‑performance training.
4.2 Storage
Middleware examples include Alibaba X‑RDMA (CLUSTER 2019) and NVMe‑over‑Fabric.
4.3 Practical Tips
Various tricks for key‑value services, remote memory, and in‑memory transaction processing (e.g., FaRM, Fast In‑memory Transaction Processing) are highlighted.
5. RDMA Performance Optimization, Monitoring, and Operations
5.1 Performance Optimization
RDMA breaks traditional layering, coupling application logic with NIC verbs, exposing PCIe, OS scheduling, and cache management bottlenecks. Tools like Collie (NSDI 2022) use simulated annealing to locate performance anomalies. Practical advice emphasizes NIC selection, cache management, and cross‑layer joint optimization.
5.2 Monitoring and Operations
Beyond second‑level TCP monitoring, RDMA requires millisecond‑level metrics (PFC, ECN, CNP, NACK) for reliable operation.
6. Industry RDMA Approaches
Amazon SRD/EFA (IEEE Micro 2020) and other large‑scale deployments.
7. Deployment Experiences
7.1 Microsoft
Early large‑scale RoCE v2 deployment, discussing transport livelock, PFC deadlock, pause storms, and slow‑receiver symptoms (SIGCOMM 2016).
7.2 Alibaba
RDMA use in the Pangu storage cluster (NSDI 2021).
8. Summary and Outlook
RDMA has shown huge potential in AI training and high‑performance storage. Future challenges include pod‑to‑pod and full‑network RDMA scaling, seamless migration of existing applications, and low‑overhead diagnostics and monitoring.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
