Eight-Step NVMe Read I/O Process over PCIe Explained
This article details the eight-step NVMe read I/O workflow over PCIe, covering the protocol layers, submission and completion queues, doorbell registers, data transfer mechanisms like PRP, and a trace‑based walkthrough of each host‑SSD interaction.
NVMe transmission is an abstract protocol layer designed to provide reliable NVMe command and data transfer. This article explains how a host reads data from an NVMe SSD and outlines the eight steps involved in a read I/O operation.
The process spans multiple layers, starting from the PCIe transaction layer where both NVMe commands and data are encapsulated into TLP packets. NVMe SSDs (U.2, M.2, etc.) communicate with the host via PCIe slots.
The PCI Express layered model diagram illustrates each protocol layer. The article uses a trace record, combined with NVMe and TLP information, to dissect the read I/O.
NVMe’s command handling flow (Revision 1.2a) is presented, noting that different NVMe versions do not fundamentally change the read operation.
Key concepts include the host and SSD as the two actors, the Submission Queue (SQ) and Completion Queue (CQ) residing in host memory, and the distinction between Admin and NVM commands.
Each queue has a Tail Doorbell (for SQ) and a Head Doorbell (for CQ) that map to controller registers in the host’s BAR space. Updating these doorbells notifies the SSD of new commands or completed operations.
Data transfer uses PRP (Physical Region Page) or SGL. In the example, a 4 KiB read fits within a single memory page, so PRP1 alone is used.
The trace shows a 3.2 TB PBlaze5 NVMe SSD reading 1024 dwords (4 KiB) from LBA 0x8, delivering the data to host address 0xFEB84000, and completing successfully (SC = Successful Completion).
Step 1: Host prepares the command and places it in the Submission Queue.
Step 2: Host updates the SQ Tail Doorbell (mapped at 0xC6421260) to inform the SSD of the new command.
Step 3: SSD reads the command from the Submission Queue via a Memory Read TLP, receiving 64 bytes (16 dwords) of command data.
Step 4: SSD processes the command and uses DMA to write the requested data (LBA 0x8, NLB 0x7) to the host’s PRP1 address (0xFEB84000).
The 4 KiB payload is split into sixteen 256‑byte TLP packets due to the MaxPayloadSize limit. PRP1 points to the first host memory page; if more pages were needed, a PRP list would be used.
Step 5: SSD posts a 16‑byte completion entry to the Completion Queue.
The completion entry shows a Successful Completion status code.
Step 6: SSD generates an MSI‑X interrupt to notify the host.
Step 7: Host processes the completion entry internally.
Step 8: Host updates the Completion Queue Head Doorbell, completing the read I/O cycle.
The article concludes that while NVMe defines many other commands, the principles demonstrated here apply broadly, enabling readers to understand host‑SSD interactions in NVMe protocols.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
