Fundamentals 28 min read

NVMe over PCIe: Specification Overview, Architecture, Register Configuration, Commands, and Data Structures

This article provides a comprehensive technical overview of the NVMe (Non‑Volatile Memory Express) specification, covering its logical device interface, namespace concepts, queue structures, PCIe register layout, command formats, controller initialization, interrupt handling, and data protection mechanisms.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
NVMe over PCIe: Specification Overview, Architecture, Register Configuration, Commands, and Data Structures

1. Overview

NVMe (Non‑Volatile Memory Express) is a logical device interface specification similar to AHCI, defining a bus‑level protocol for accessing non‑volatile storage media attached via PCI Express (PCIe).

1.1 Terminology

1.1.1 Namespace

A Namespace is a collection of a certain number of logical blocks (LBs) whose attributes are defined in the Identify Controller data structure.

1.1.2 Fused Operations

Fused Operations aggregate two adjacent commands in a submission queue, ensuring atomicity of the combined read/write actions (see the Compare and Write example).

1.1.3 Command Execution Order

Except for fused operations, each command in a submission queue (SQ) is independent and does not require host‑side handling of RAW dependencies.

1.1.4 Write Atomicity

The controller must support write‑unit atomicity; hosts can configure the Write Atomicity feature to reduce the atomic unit size for better performance.

1.1.5 Metadata

Metadata provides additional verification information for data blocks and is optional.

1.1.6 Arbitration Mechanism

Arbitration selects the next SQ to execute commands. Three arbitration methods are defined:

RR (Round‑Robin) – equal priority for all queues.

Weighted RR – queues have four priority levels.

Custom implementation.

1.1.7 Logical Block (LB)

NVMe defines the smallest read/write unit (e.g., 2 KB, 4 KB). LBA identifies block addresses, and an LBA range represents a physically contiguous set of logical blocks.

1.1.8 Queue Pair

A Queue Pair consists of a Submission Queue (SQ) and a Completion Queue (CQ); the host submits commands via SQ, and the NVMe controller reports completions via CQ.

1.1.9 NVM Subsystem

The NVM subsystem includes the controller, NVM storage media, and the interface between them.

1.2 NVMe SSD

1.2.1 Basic Architecture

An NVMe SSD consists of three parts: the host‑side driver (integrated in Linux, Windows, etc.), the PCIe + NVMe controller, and the storage media (FTL + NAND flash).

1.2.2 NVMe Controller

The NVMe controller implements DMA and multiple queues, using DMA for data movement (commands and user data) and multi‑queue architecture to exploit flash parallelism.

2. PCIe Register Configuration

2.1 Basic PCIe Bus Structure

PCIe has three layers: physical, data link, and transport (similar to OSI). NVMe operates at the transport layer, abstracting the underlying PCIe.

2.2 Register Layout

The specification defines three main parts: PCI header, PCI Capabilities, and PCI Express Extended Capabilities. The register block occupies 4 KB in host memory.

2.2.1 PCI Header

Two header types exist: type 0 for devices (NVMe controllers are EP devices) and type 1 for bridges. The header size is 64 KB.

2.2.2 PCI Capabilities

Includes power management, interrupt management (MSI, MSI‑X), and PCIe capabilities.

2.2.3 PCI Express Extended Capabilities

Provides advanced features such as error recovery.

3. NVMe Register Definition

3.1 Register Definitions

NVMe registers are divided into controller‑wide attributes and per‑queue head/tail doorbell registers.

CAP – controller capabilities (page size, I/O command set, arbitration, queue continuity, etc.).

VS – version number of the NVMe specification implemented.

INTMS – interrupt mask (invalid when MSI‑X is used).

INTMC – interrupt enable (invalid when MSI‑X is used).

CC – controller configuration (queue element size, shutdown notification, arbitration, page size, enable flag).

CSTS – controller status (shutdown state, fatal error, ready).

AQA – admin queue attributes (SQ and CQ sizes).

ASQ – admin SQ base address.

ACQ – admin CQ base address.

Registers above 0x1000 define per‑queue head and tail doorbells.

3.2 Register Interpretation

CAP shows the full set of capabilities; CC selects a subset for operation.

Setting CC.EN to 1 enables command processing; clearing it resets the controller.

CC.EN and CSTS.RDY are tightly coupled; CSTS.RDY changes only after CC.EN is set.

Admin queues are created directly by the host via AQA, ASQ, and ACQ registers.

I/O queues are created by issuing admin commands (e.g., Create I/O CQ).

Doorbell registers are 16‑bit, limiting queue depth to 64 K.

4. Memory Data Structures

4.1 SQ and CQ Details

4.1.1 Empty Queue

(Illustration omitted)

4.1.2 Full Queue

A queue is considered full when the head pointer is one position ahead of the tail, leaving one element unused.

4.1.3 Queue Properties

Queue size is 16 bit; minimum size is 2 elements. I/O queues can be up to 64 K entries, admin queues up to 4 K. QID is a 16‑bit identifier allocated by the host.

4.2 Arbitration Mechanism

4.2.1 RR

Round‑Robin arbitration treats admin and I/O SQs with equal priority; the controller may select multiple commands per arbitration burst.

4.2.2 Weighted RR

Three strict priority levels (Priority 1 > Priority 2 > Priority 3). Higher‑priority queues are serviced first (non‑preemptive).

4.2.3 Vendor‑Specific

Custom arbitration defined by the vendor.

4.3 Data Addressing (PRP and SGL)

4.3.1 PRP

PRP (Physical Region Page) is a 64‑bit physical address. The host memory is divided into pages; the page size is configured in the CC register. PRP can point directly to a page or to a PRP List.

PRP List entries are 8 bytes each and point to additional pages when the data span exceeds one page.

4.3.2 SGL

SGL (Scatter‑Gather List) consists of segments, each made up of descriptors. Six descriptor types are defined: data, garbage, segment, last‑segment, keyed data, and transport data.

4.3.3 PRP vs. SGL

Both describe memory regions; PRP maps to physical pages, while SGL can describe arbitrary contiguous or non‑contiguous memory regions, offering greater flexibility.

5. NVMe Commands

5.0 Command Execution Process

The host writes commands into an SQ, updates the tail doorbell, the controller DMA‑fetches the command, executes it, writes a completion entry into the CQ, signals an interrupt (typically MSI‑X), and the host processes the completion and updates the head doorbell.

5.1 Command Classification

Commands are divided into Admin commands (manage the controller) and NVM (I/O) commands (transfer data).

5.2 Generic Command Format

All commands are 64 bytes with a common layout; fields vary by command type.

Dword0

CID, transport, fused operation, opcode

1

Namespace ID (NID)

2‑3

Reserved

4‑5

Metadata Pointer (MPTR)

6‑9

Data Pointer (DPTR)

10‑15

Command‑specific fields

5.3 Admin Commands

Opcode

Command

Purpose

00h

Delete I/O SQ

Free SQ resources

01h

Create I/O SQ

Allocate host‑provided SQ address, priority, size

02h

Get Log Page

Retrieve selected log page into buffer

04h

Delete I/O CQ

Free CQ resources

05h

Create I/O CQ

Allocate host‑provided CQ address, interrupt vector, size

06h

Identify

Return controller and namespace capability structures (2 KB)

08h

Abort

Attempt to abort a previously issued command

09h

Set Features

Configure a feature identified by FID

0Ah

Get Features

Query a feature identified by FID

0Ch

Async Event Request

Controller reports error or health events

10h

Firmware Activate

Validate and activate a firmware image in a slot

11h

Firmware Image Download

Transfer a firmware image to the controller

5.4 NVM (I/O) Commands

Opcode

Command

Purpose

00h

Flush

Commit data and metadata to NVM

01h

Write

Write data and metadata to NVM

02h

Read

Read data and metadata from NVM

04h

Write Uncorrectable

Mark a data block as uncorrectable

05h

Compare

Compare host buffer with data read from NVM

09h

Dataset Management

Tag a range of data with usage hints (e.g., frequent read/write)

6. Controller Structure and Operation

The controller is logically divided into I/O, Admin, and Discovery functions. The Admin controller (single instance) manages the overall controller and other functions.

6.1 Command Execution Flow

Host writes one or more commands into a pre‑allocated SQ.

Host updates the SQ tail doorbell.

NVMe controller fetches commands via DMA (checking head/tail doorbells).

Controller executes the command.

Controller writes a completion entry into the CQ via DMA.

Controller signals the host via an interrupt (MSI‑X recommended).

Host processes the completion and updates the CQ head doorbell.

6.2 Reset

6.2.1 Controller‑Level Reset

All I/O SQs and CQs are deleted.

Pending commands are aborted.

Controller enters idle; CSTS.RDY is cleared.

AQA, ASQ, ACQ remain unchanged.

After reset, the host re‑enables the controller (CC.EN = 1), waits for CSTS.RDY, configures admin queues, creates I/O queues, and resumes normal I/O.

6.2.2 Queue‑Level Reset

Deleting a specific queue requires the host to ensure the queue is idle (all commands completed) before removal.

6.3 Interrupts

NVMe supports four interrupt types: pin‑based, single MSI, MSI‑multiple, and MSI‑X. MSI‑X is recommended because it allows each CQ to have its own interrupt vector (up to 2 K vectors).

6.4 Controller Initialization

Configure PCI and PCIe registers.

Wait for CSTS.RDY to become set.

Program AQA, ASQ, ACQ.

Program CC.

Set CC.EN = 1.

Wait for CSTS.RDY = 1.

Issue Identify to discover controller and namespace structures.

Get features to learn I/O queue limits and configure interrupts.

Allocate and create I/O SQs and CQs.

Optionally submit an Asynchronous Event Request for health monitoring.

6.5 Shutdown

Normal shutdown: stop submitting new I/O, delete all I/O SQs (causing pending commands to be aborted), delete all I/O CQs, set CC.SHN = 01b, and wait for CSTS.SHST = 10b. Abrupt shutdown: set CC.SHN = 10b and wait for CSTS.SHST = 10b.

7. NVMe Features

7.1 Firmware Update Process

Download firmware image using the Firmware Image Download command.

Issue Firmware Activate command (or activate a previously downloaded image).

Controller reset.

Host re‑initializes the controller and recreates I/O queues.

7.2 Metadata Transfer

Metadata can be attached to a logical block as part of the data payload or transferred as a separate logical block. It is typically used for end‑to‑end data protection.

7.3 End‑to‑End Data Protection

Data protection uses a 16‑bit Guard (CRC), an Application Tag associated with the LBA, and a Reference Tag linking user data to its address. Three protection configurations are defined based on whether protection is enabled and the PRACT bit setting.

Original source: https://zhuanlan.zhihu.com/p/347599423 (author: Fappy)

StorageControllerProtocolSSDNVMePCIe
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.