Fundamentals 21 min read

Comprehensive Overview of NVMe over PCIe: Architecture, Registers, Commands, and Data Structures

This article provides an in‑depth technical overview of the NVMe (Non‑Volatile Memory Express) protocol over PCIe, covering its logical device interface, namespace concepts, queue mechanisms, register layouts, command formats, controller initialization, interrupt handling, and data protection features.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Comprehensive Overview of NVMe over PCIe: Architecture, Registers, Commands, and Data Structures

1. Overview

NVMe (Non‑Volatile Memory Express) is a logical device interface specification similar to AHCI, designed for accessing non‑volatile storage media attached via the PCIe bus. It defines the protocol, instruction set, and register configuration for high‑performance SSDs.

1.1 Terminology

Namespace : A collection of logical blocks (LB) whose attributes are defined in the Identify Controller data structure.

Fused Operations : Aggregated operations that combine two adjacent commands in a submission queue, ensuring atomicity for NVM commands.

Write Atomicity : Controllers must support atomic write units; hosts can configure the Write Atomicity feature to reduce the atomic unit size for better performance.

Arbitration Mechanism : Determines which submission queue (SQ) is serviced next, with three methods – round‑robin (RR), weighted RR, and custom implementations.

2. NVMe SSD Architecture

An NVMe SSD consists of three parts: the host driver (integrated in Linux/Windows), the PCIe‑connected NVMe controller, and the storage medium (FTL + NAND flash).

The controller operates as a DMA engine combined with multiple queues to exploit flash parallelism.

3. PCIe Register Configuration

The NVMe controller occupies 4 KB of host memory, divided into a PCI header, PCI capabilities, and PCI Express extended capabilities.

PCI Header (type 0 for endpoint devices) defines basic device information.

PCI Capabilities include power management, interrupt management (MSI/MSI‑X), and PCIe capabilities.

PCI Express Extended Capabilities provide advanced features such as error recovery.

4. NVMe Register Definitions

Key registers include:

CAP – controller capabilities (page size, I/O command set, arbitration, queue attributes)

VS – version number

INTMS / INTMC – interrupt mask and enable

CC – controller configuration (queue element size, shutdown notification, arbitration, page size, enable)

CSTS – controller status (shutdown, fatal error, ready)

AQA – admin queue attributes (SQ and CQ sizes)

ASQ / ACQ – admin SQ and CQ base addresses

Registers above 0x1000 define head and tail doorbell registers for each queue.

5. Memory Data Structures

Submission Queues (SQ) and Completion Queues (CQ) are paired structures. An empty queue has head = tail, while a full queue leaves one slot unused (head = tail + 1).

Queue size is 16 bits: minimum 2 entries, maximum 64 K for I/O queues and 4 K for admin queues. Queue IDs (QID) are 16‑bit values allocated by the host.

5.1 PRP (Physical Region Page)

PRP entries are 64‑bit physical addresses pointing to memory pages. Two addressing modes exist: direct PRP pointer and PRP List (multiple pages). PRP is used for admin commands and optionally for I/O commands.

5.2 SGL (Scatter‑Gather List)

SGL consists of segments and descriptors, allowing non‑contiguous memory regions to be described. Six descriptor types are defined, enabling flexible data placement.

5.3 PRP vs. SGL

Both describe host memory buffers; PRP maps to whole pages, while SGL can describe arbitrary sized regions, offering greater flexibility.

6. NVMe Commands

Commands are 64 bytes with a common format. They are divided into Admin commands (controller management) and NVM (I/O) commands.

6.1 Admin Commands

Examples include Create/Delete I/O SQ/CQ, Identify, Get/Set Features, Firmware Download/Activate, Asynchronous Event Request, and Abort.

6.2 NVM Commands

Examples include Flush, Write, Read, Write Uncorrectable, Compare, and Dataset Management.

7. Controller Operation

The controller processes commands by reading from SQs via DMA, executing them, writing completion entries to CQ, and notifying the host via MSI‑X interrupts. Host updates doorbell registers to inform the controller of new commands and to acknowledge completions.

7.1 Reset and Shutdown

Controller reset clears all I/O queues, aborts pending commands, and sets the controller to idle. Host must re‑configure registers, enable the controller (CC.EN), and recreate queues.

7.2 Interrupts

Four interrupt mechanisms are defined: pin‑based, single MSI, MSI‑multiple, and MSI‑X (recommended). MSI‑X allows each CQ to generate its own interrupt vector.

8. Features

Firmware update involves downloading the image, activating it, resetting the controller, and re‑initializing queues.

Metadata can be attached to data blocks for end‑to‑end protection, providing CRC, application tags, and reference tags.

End‑to‑end data protection uses metadata to detect and correct errors introduced over PCIe or within NAND flash.

storageregistersProtocolssdNVMecommandsPCIe
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.