NVMe over PCIe: Specification Overview, Architecture, Register Configuration, Commands, and Data Structures
This article provides a comprehensive technical overview of the NVMe (Non‑Volatile Memory Express) specification, covering its logical device interface, namespace concepts, queue structures, PCIe register layout, command formats, controller initialization, interrupt handling, and data protection mechanisms.
1. Overview
NVMe (Non‑Volatile Memory Express) is a logical device interface specification similar to AHCI, defining a bus‑level protocol for accessing non‑volatile storage media attached via PCI Express (PCIe).
1.1 Terminology
1.1.1 Namespace
A Namespace is a collection of a certain number of logical blocks (LBs) whose attributes are defined in the Identify Controller data structure.
1.1.2 Fused Operations
Fused Operations aggregate two adjacent commands in a submission queue, ensuring atomicity of the combined read/write actions (see the Compare and Write example).
1.1.3 Command Execution Order
Except for fused operations, each command in a submission queue (SQ) is independent and does not require host‑side handling of RAW dependencies.
1.1.4 Write Atomicity
The controller must support write‑unit atomicity; hosts can configure the Write Atomicity feature to reduce the atomic unit size for better performance.
1.1.5 Metadata
Metadata provides additional verification information for data blocks and is optional.
1.1.6 Arbitration Mechanism
Arbitration selects the next SQ to execute commands. Three arbitration methods are defined:
RR (Round‑Robin) – equal priority for all queues.
Weighted RR – queues have four priority levels.
Custom implementation.
1.1.7 Logical Block (LB)
NVMe defines the smallest read/write unit (e.g., 2 KB, 4 KB). LBA identifies block addresses, and an LBA range represents a physically contiguous set of logical blocks.
1.1.8 Queue Pair
A Queue Pair consists of a Submission Queue (SQ) and a Completion Queue (CQ); the host submits commands via SQ, and the NVMe controller reports completions via CQ.
1.1.9 NVM Subsystem
The NVM subsystem includes the controller, NVM storage media, and the interface between them.
1.2 NVMe SSD
1.2.1 Basic Architecture
An NVMe SSD consists of three parts: the host‑side driver (integrated in Linux, Windows, etc.), the PCIe + NVMe controller, and the storage media (FTL + NAND flash).
1.2.2 NVMe Controller
The NVMe controller implements DMA and multiple queues, using DMA for data movement (commands and user data) and multi‑queue architecture to exploit flash parallelism.
2. PCIe Register Configuration
2.1 Basic PCIe Bus Structure
PCIe has three layers: physical, data link, and transport (similar to OSI). NVMe operates at the transport layer, abstracting the underlying PCIe.
2.2 Register Layout
The specification defines three main parts: PCI header, PCI Capabilities, and PCI Express Extended Capabilities. The register block occupies 4 KB in host memory.
2.2.1 PCI Header
Two header types exist: type 0 for devices (NVMe controllers are EP devices) and type 1 for bridges. The header size is 64 KB.
2.2.2 PCI Capabilities
Includes power management, interrupt management (MSI, MSI‑X), and PCIe capabilities.
2.2.3 PCI Express Extended Capabilities
Provides advanced features such as error recovery.
3. NVMe Register Definition
3.1 Register Definitions
NVMe registers are divided into controller‑wide attributes and per‑queue head/tail doorbell registers.
CAP – controller capabilities (page size, I/O command set, arbitration, queue continuity, etc.).
VS – version number of the NVMe specification implemented.
INTMS – interrupt mask (invalid when MSI‑X is used).
INTMC – interrupt enable (invalid when MSI‑X is used).
CC – controller configuration (queue element size, shutdown notification, arbitration, page size, enable flag).
CSTS – controller status (shutdown state, fatal error, ready).
AQA – admin queue attributes (SQ and CQ sizes).
ASQ – admin SQ base address.
ACQ – admin CQ base address.
Registers above 0x1000 define per‑queue head and tail doorbells.
3.2 Register Interpretation
CAP shows the full set of capabilities; CC selects a subset for operation.
Setting CC.EN to 1 enables command processing; clearing it resets the controller.
CC.EN and CSTS.RDY are tightly coupled; CSTS.RDY changes only after CC.EN is set.
Admin queues are created directly by the host via AQA, ASQ, and ACQ registers.
I/O queues are created by issuing admin commands (e.g., Create I/O CQ).
Doorbell registers are 16‑bit, limiting queue depth to 64 K.
4. Memory Data Structures
4.1 SQ and CQ Details
4.1.1 Empty Queue
(Illustration omitted)
4.1.2 Full Queue
A queue is considered full when the head pointer is one position ahead of the tail, leaving one element unused.
4.1.3 Queue Properties
Queue size is 16 bit; minimum size is 2 elements. I/O queues can be up to 64 K entries, admin queues up to 4 K. QID is a 16‑bit identifier allocated by the host.
4.2 Arbitration Mechanism
4.2.1 RR
Round‑Robin arbitration treats admin and I/O SQs with equal priority; the controller may select multiple commands per arbitration burst.
4.2.2 Weighted RR
Three strict priority levels (Priority 1 > Priority 2 > Priority 3). Higher‑priority queues are serviced first (non‑preemptive).
4.2.3 Vendor‑Specific
Custom arbitration defined by the vendor.
4.3 Data Addressing (PRP and SGL)
4.3.1 PRP
PRP (Physical Region Page) is a 64‑bit physical address. The host memory is divided into pages; the page size is configured in the CC register. PRP can point directly to a page or to a PRP List.
PRP List entries are 8 bytes each and point to additional pages when the data span exceeds one page.
4.3.2 SGL
SGL (Scatter‑Gather List) consists of segments, each made up of descriptors. Six descriptor types are defined: data, garbage, segment, last‑segment, keyed data, and transport data.
4.3.3 PRP vs. SGL
Both describe memory regions; PRP maps to physical pages, while SGL can describe arbitrary contiguous or non‑contiguous memory regions, offering greater flexibility.
5. NVMe Commands
5.0 Command Execution Process
The host writes commands into an SQ, updates the tail doorbell, the controller DMA‑fetches the command, executes it, writes a completion entry into the CQ, signals an interrupt (typically MSI‑X), and the host processes the completion and updates the head doorbell.
5.1 Command Classification
Commands are divided into Admin commands (manage the controller) and NVM (I/O) commands (transfer data).
5.2 Generic Command Format
All commands are 64 bytes with a common layout; fields vary by command type.
Dword0
CID, transport, fused operation, opcode
1
Namespace ID (NID)
2‑3
Reserved
4‑5
Metadata Pointer (MPTR)
6‑9
Data Pointer (DPTR)
10‑15
Command‑specific fields
5.3 Admin Commands
Opcode
Command
Purpose
00h
Delete I/O SQ
Free SQ resources
01h
Create I/O SQ
Allocate host‑provided SQ address, priority, size
02h
Get Log Page
Retrieve selected log page into buffer
04h
Delete I/O CQ
Free CQ resources
05h
Create I/O CQ
Allocate host‑provided CQ address, interrupt vector, size
06h
Identify
Return controller and namespace capability structures (2 KB)
08h
Abort
Attempt to abort a previously issued command
09h
Set Features
Configure a feature identified by FID
0Ah
Get Features
Query a feature identified by FID
0Ch
Async Event Request
Controller reports error or health events
10h
Firmware Activate
Validate and activate a firmware image in a slot
11h
Firmware Image Download
Transfer a firmware image to the controller
5.4 NVM (I/O) Commands
Opcode
Command
Purpose
00h
Flush
Commit data and metadata to NVM
01h
Write
Write data and metadata to NVM
02h
Read
Read data and metadata from NVM
04h
Write Uncorrectable
Mark a data block as uncorrectable
05h
Compare
Compare host buffer with data read from NVM
09h
Dataset Management
Tag a range of data with usage hints (e.g., frequent read/write)
6. Controller Structure and Operation
The controller is logically divided into I/O, Admin, and Discovery functions. The Admin controller (single instance) manages the overall controller and other functions.
6.1 Command Execution Flow
Host writes one or more commands into a pre‑allocated SQ.
Host updates the SQ tail doorbell.
NVMe controller fetches commands via DMA (checking head/tail doorbells).
Controller executes the command.
Controller writes a completion entry into the CQ via DMA.
Controller signals the host via an interrupt (MSI‑X recommended).
Host processes the completion and updates the CQ head doorbell.
6.2 Reset
6.2.1 Controller‑Level Reset
All I/O SQs and CQs are deleted.
Pending commands are aborted.
Controller enters idle; CSTS.RDY is cleared.
AQA, ASQ, ACQ remain unchanged.
After reset, the host re‑enables the controller (CC.EN = 1), waits for CSTS.RDY, configures admin queues, creates I/O queues, and resumes normal I/O.
6.2.2 Queue‑Level Reset
Deleting a specific queue requires the host to ensure the queue is idle (all commands completed) before removal.
6.3 Interrupts
NVMe supports four interrupt types: pin‑based, single MSI, MSI‑multiple, and MSI‑X. MSI‑X is recommended because it allows each CQ to have its own interrupt vector (up to 2 K vectors).
6.4 Controller Initialization
Configure PCI and PCIe registers.
Wait for CSTS.RDY to become set.
Program AQA, ASQ, ACQ.
Program CC.
Set CC.EN = 1.
Wait for CSTS.RDY = 1.
Issue Identify to discover controller and namespace structures.
Get features to learn I/O queue limits and configure interrupts.
Allocate and create I/O SQs and CQs.
Optionally submit an Asynchronous Event Request for health monitoring.
6.5 Shutdown
Normal shutdown: stop submitting new I/O, delete all I/O SQs (causing pending commands to be aborted), delete all I/O CQs, set CC.SHN = 01b, and wait for CSTS.SHST = 10b. Abrupt shutdown: set CC.SHN = 10b and wait for CSTS.SHST = 10b.
7. NVMe Features
7.1 Firmware Update Process
Download firmware image using the Firmware Image Download command.
Issue Firmware Activate command (or activate a previously downloaded image).
Controller reset.
Host re‑initializes the controller and recreates I/O queues.
7.2 Metadata Transfer
Metadata can be attached to a logical block as part of the data payload or transferred as a separate logical block. It is typically used for end‑to‑end data protection.
7.3 End‑to‑End Data Protection
Data protection uses a 16‑bit Guard (CRC), an Application Tag associated with the LBA, and a Reference Tag linking user data to its address. Three protection configurations are defined based on whether protection is enabled and the PRACT bit setting.
Original source: https://zhuanlan.zhihu.com/p/347599423 (author: Fappy)
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.