Detailed Overview of NVMe Architecture and NVMe over Fabrics
This article provides a comprehensive technical overview of NVMe architecture, the NVMe‑over‑Fabric extensions—including InfiniBand, RoCE, iWARP, Fibre Channel, and TCP—explaining their RDMA‑based advantages, protocol differences, and practical considerations for data‑center storage deployments.
NVMe transmission is an abstract protocol layer that delivers reliable NVMe commands and data. To extend NVMe beyond the PCIe bus for data‑center storage, NVMe‑over‑Fabric (NVMe‑oF) maps NVMe onto multiple fabric transports, primarily Fibre Channel (FC), InfiniBand, RoCE v2, iWARP, and TCP.
The fabric options that support RDMA—InfiniBand, RoCE v2 (routable RoCE), and iWARP—are considered ideal because they enable direct memory access without CPU involvement, offering high bandwidth, low latency, and low resource utilization.
InfiniBand (IB): A next‑generation network protocol that supports RDMA from the outset, requiring compatible NICs and switches.
RDMA over Ethernet (RoCE): Allows RDMA over standard Ethernet by encapsulating InfiniBand headers within Ethernet frames; only the NIC needs special support.
iWARP: Implements RDMA over TCP, enabling RDMA on standard Ethernet hardware but with reduced performance compared to native RDMA.
RDMA’s key advantages are Zero‑Copy (no data copying across protocol layers), Kernel‑Bypass (applications interact directly with the device), and None‑CPU (data transfer handled entirely by the NIC).
NVMe‑oF differs from traditional NVMe in that it uses a message‑based model to send requests and responses over the network, extending the PCIe‑based command mechanism to remote targets while aiming for sub‑10 µs latency.
Specific protocol extensions in NVMe‑oF include new naming mechanisms (e.g., SUBNQN), capsule‑based messaging, expanded Scatter‑Gather List support, discovery and connection procedures, and the removal of PCIe‑specific interrupt handling.
Fibre Channel (FC‑NVMe) adapts the NVMe command set to the FC transport, leveraging FC’s credit‑based flow control and zero‑copy DMA capabilities. Major vendors such as Broadcom and Cavium provide FC‑NVMe HBAs, and newer FC switches (e.g., Gen 6) support NVMe‑oF.
While RDMA is often highlighted as the ideal transport, the NVMe‑over‑TCP option demonstrates that a standard TCP fabric can also satisfy NVMe‑oF requirements, especially when combined with TCP offloading, smart NICs, or FPGA acceleration.
Reasons for the emergence of NVMe‑over‑TCP include NVMe virtualization (allowing virtual targets), backward compatibility with existing Ethernet infrastructure, TCP offloading technologies, and the simplicity of software RoCE alternatives for testing.
In summary, the article outlines the technical landscape of NVMe‑over‑Fabric, comparing RDMA‑based fabrics (InfiniBand, RoCE, iWARP, FC) with TCP‑based solutions, and discusses the trade‑offs that influence storage architects when selecting a fabric for high‑performance NVMe deployments.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.