Why RDMA Makes NVMe‑over‑Fabric Faster: A Deep Dive into Fabrics, FC, InfiniBand, RoCE and TCP
The article examines how NVMe‑over‑Fabric extends NVMe beyond PCIe using various fabrics—FC, InfiniBand, RoCE v2, iWARP and TCP—highlighting RDMA’s zero‑copy, kernel‑bypass and CPU‑free advantages, and comparing protocol differences, performance trade‑offs, and the evolution toward NVMe/TCP.
NVMe‑over‑Fabric Overview
NVMe‑over‑Fabric (NVMe‑oF) abstracts the NVMe command and data transport layer to enable reliable NVMe communication over network fabrics, challenging the dominance of SCSI in SAN environments. The standard supports multiple fabric transports, primarily Fibre Channel (FC), InfiniBand, RoCE v2, iWARP and TCP.
Why RDMA‑Based Fabrics Are Preferred
InfiniBand, RoCE v2 (routable RoCE) and iWARP are considered ideal fabrics because they support Remote Direct Memory Access (RDMA). RDMA allows a server’s NIC to read or write memory on a remote server directly, bypassing the CPU and OS kernel, resulting in high bandwidth, low latency and low resource utilization.
InfiniBand (IB): A next‑generation network protocol that natively supports RDMA but requires RDMA‑capable NICs and switches.
RDMA over Converged Ethernet (RoCE): Enables RDMA over standard Ethernet by encapsulating InfiniBand headers inside Ethernet frames; only the NIC needs special support.
iWARP: Implements RDMA over TCP, allowing RDMA on standard Ethernet hardware; however, software‑based iWARP stacks lose many performance benefits of hardware RDMA.
Key RDMA advantages include:
Zero‑Copy: Data does not traverse multiple protocol‑stack layers, shortening the data path.
Kernel‑Bypass: Applications interact directly with the device interface, eliminating system‑call overhead.
None‑CPU: Transfer is handled entirely by the NIC, freeing CPU cycles.
NVMe‑oF vs. Traditional NVMe
Traditional NVMe uses the PCIe interface to map requests and responses into host shared memory. NVMe‑oF replaces PCIe with a message‑based model that transports these requests over a network, aiming for ≤10 µs latency between host and target when connected via an appropriate fabric.
Technical differences include:
Extended naming mechanisms (e.g., SUBNQN).
New capsule‑based message formats (Capsule, Response Capsule).
Support for In‑Capsule Data via Scatter‑Gather Lists (SGLs).
Discovery and Connect mechanisms for locating NVMe subsystems.
Queue creation commands are moved to the connection layer; legacy Queue create/delete commands are removed.
No interrupt mechanism under PCIe architecture.
CQ flow control limits outstanding capsules per queue.
NVMe‑oF supports only SGLs, whereas NVMe over PCIe supports both SGL and PRP.
Fibre Channel (FC) as an NVMe‑oF Fabric
FC‑NVMe simplifies the NVMe command set to basic FCP commands and inherits FC’s credit‑based flow control, reliability, and zero‑copy DMA support. FC‑NVMe is suited for large block‑flash storage deployments and can coexist with traditional SCSI traffic on the same FC infrastructure.
Key components that must support FC‑NVMe include storage operating systems and network adapters; vendors such as Broadcom and Cavium provide compliant HBAs, and Brocade’s Gen 6 FC switches already support NVMe‑oF.
NVMe over TCP (NVMe/TCP)
NVMe/TCP emerged to address scenarios where RDMA hardware is unavailable. Its adoption is driven by four main factors:
NVMe virtualization: Virtual NVMe targets do not require the full performance of physical devices, making TCP acceptable.
Backward compatibility: NVMe‑oF aims to replace iSCSI; TCP allows legacy Ethernet equipment to participate without RDMA‑capable NICs.
TCP offloading: Smart NICs or FPGA‑based offload engines can mitigate TCP’s inherent latency.
Software RoCE simplification: TCP eliminates the need for kernel modules that emulate RDMA over UDP, simplifying test deployments.
While TCP adds latency compared to native RDMA, these mitigations make it a viable fabric for many data‑center workloads.
In summary, RDMA‑based fabrics (InfiniBand, RoCE, iWARP) provide the lowest latency and CPU overhead for NVMe‑oF, but FC remains a strong contender due to its mature credit‑based flow control. NVMe/TCP offers a flexible, hardware‑agnostic alternative, albeit with higher latency that can be offset by offloading technologies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
