Fundamentals 24 min read

Demystifying NVMe: From Protocol Basics to PCIe Register Configurations

This comprehensive guide explains the NVMe specification, covering terminology, SSD architecture, PCIe register layout, queue structures, arbitration mechanisms, data addressing methods, command formats, controller operation, reset procedures, interrupt handling, and advanced features such as firmware updates and end‑to‑end data protection.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Demystifying NVMe: From Protocol Basics to PCIe Register Configurations

Overview

NVMe (Non‑Volatile Memory Express) is a logical device interface specification similar to AHCI, designed for accessing non‑volatile storage attached via the PCIe bus. It defines the protocol, command set, and register layout used by hosts and controllers.

Terminology

Namespace : A collection of logical blocks (LB) whose attributes are defined in the Identify Controller data structure.

Fused Operations : Aggregated commands that must appear consecutively in a submission queue; only NVM commands support this and they must be executed atomically.

Command Execution Order : Apart from fused operations, each command in a submission queue (SQ) is independent; data‑dependency issues are the host’s responsibility.

Write Atomicity : Controllers must support atomic write units, but hosts can configure the Write Atomicity feature to reduce the atomic unit size for performance.

Metadata : Optional extra information that provides validation or other auxiliary data.

Arbitration Mechanism : Determines which SQ’s commands are selected next; three methods are defined: round‑robin (RR), weighted RR, and vendor‑specific.

Logical Block (LB) : The smallest read/write unit defined by NVMe (e.g., 2 KB, 4 KB), addressed by LBA.

Queue Pair : Consists of a Submission Queue (SQ) and a Completion Queue (CQ); hosts submit commands via SQ, controllers return completions via CQ.

NVM Subsystem : Includes the controller, NVM storage media, and the interface between them.

NVMe SSD Architecture

An NVMe SSD comprises three parts: the host‑side driver (integrated in Linux, Windows, etc.), the PCIe + NVMe controller, and the storage media (FTL + NAND flash).

NVMe SSD block diagram
NVMe SSD block diagram

NVMe Controller

The controller operates as a DMA engine combined with multiple queues, allowing parallel flash access.

NVMe controller diagram
NVMe controller diagram

PCIe Register Configuration

NVMe over PCIe abstracts the physical layer, defining three register groups in host memory (4 KB total): PCI header, PCI capabilities, and PCI Express extended capabilities.

PCI Header

Two header types exist: type 0 for devices (NVMe controllers are endpoint devices) and type 1 for bridges. The NVMe controller uses a type 0 header occupying 64 KB.

PCI header layout
PCI header layout

PCI Capabilities

Includes power management, interrupt management (MSI, MSI‑X), and PCIe capabilities.

PCI Express Extended Capabilities

Provides advanced features such as error recovery.

NVMe Register Definitions

CAP : Controller capabilities (page size limits, supported I/O commands, doorbell stride, timeout, arbitration, queue continuity, queue size).

VS : Version number of the NVMe implementation.

INTMS : Interrupt mask (invalid when MSI‑X is used).

INTMC : Interrupt mask clear (invalid when MSI‑X is used).

CC : Controller configuration (I/O SQ/CQ element size, shutdown notification, arbitration, page size, enabled command set, enable flag).

CSTS : Controller status (shutdown state, fatal error, ready).

AQA : Admin queue attributes (SQ size, CQ size).

ASQ : Admin SQ base address.

ACQ : Admin CQ base address.

Registers after 0x1000 define per‑queue head and tail doorbells.

Memory Data Structures

Submission and Completion Queues

Queues are 16‑bit indexed; minimum size is 2 entries (to allow a full‑queue definition). I/O queues can be up to 64 K entries, admin queues up to 4 K. Queues have four priority levels (U, H, M, L) if supported.

Arbitration Mechanism

RR: Equal priority, round‑robin scheduling.

Weighted RR: Four priority levels; higher priority queues are serviced first (non‑preemptive).

Vendor‑specific: Custom arbitration.

Data Addressing (PRP and SGL)

NVMe uses two methods to describe host memory locations for data transfer:

PRP (Physical Region Page) : 64‑bit physical page pointers. Two PRP entries (PRP1, PRP2) can point directly to a page or to a PRP List. PRP List entries must be page‑aligned.

SGL (Scatter‑Gather List) : Consists of SGL segments, each containing descriptors (data, garbage, segment, last‑segment, keyed, transport). SGL can describe non‑contiguous memory regions, offering greater flexibility than PRP.

Admin commands must use PRP; I/O commands may use PRP or SGL, indicated by bits 15:14 of DW0.

NVMe Command Execution Flow

Host writes command(s) to a pre‑allocated SQ.

Host updates the SQ tail doorbell.

Controller fetches commands via DMA.

Controller executes the command.

Completion command is written to the CQ via DMA.

Controller signals the host via interrupt (MSI‑X recommended).

Host processes the completion and updates the CQ head doorbell.

Command Classification

Admin commands : Managed by the Admin controller, used for controller configuration, log retrieval, feature management, firmware updates, etc.

NVM (I/O) commands : Managed by the I/O controller, used for data transfer (Read, Write, Flush, Compare, Write Uncorrectable, Dataset Management, etc.).

Admin Command Set (selected)

00h – Delete I/O SQ

01h – Create I/O SQ

02h – Get Log Page

04h – Delete I/O CQ

05h – Create I/O CQ

06h – Identify

08h – Abort

09h – Set Features

0Ah – Get Features

0Ch – Asynchronous Event Request

10h – Firmware Activate

11h – Firmware Image Download

NVM Command Set (selected)

00h – Flush

01h – Write

02h – Read

04h – Write Uncorrectable

05h – Compare

09h – Dataset Management

Controller Structure and Operation

The controller consists of three functional blocks: I/O, Admin, and Discovery.

Reset Procedures

Controller‑level reset : Deletes all I/O SQ/CQ, aborts unfinished commands, clears CSTS.RDY, preserves AQA/ASQ/ACQ. Host must set CC.EN, wait for CSTS.RDY, and re‑configure queues.

Queue‑level reset : Delete and recreate a specific queue after ensuring it is idle.

Interrupts

Supported types: pin‑based, single‑message MSI, multi‑message MSI, and MSI‑X (recommended, up to 2 K vectors). MSI‑X allows each CQ to generate its own interrupt vector.

Initialization Sequence

Configure PCI and PCIe registers.

Wait for CSTS.RDY to become set.

Program AQA, ASQ, ACQ.

Program CC and set CC.EN.

Wait for CSTS.RDY.

Issue Identify to discover controller and namespace structures.

Get features to learn queue limits and configure interrupts.

Allocate I/O SQs and CQs.

Optionally submit Asynchronous Event Request for health monitoring.

NVMe Features

Firmware Update Process

Download firmware image using the Firmware Image Download command.

Activate the new firmware with Firmware Activate.

Controller reset.

Re‑initialize the controller (same steps as initialization).

Metadata Transfer

Metadata can be appended to each LB or transferred as a separate logical block. It is typically used for end‑to‑end data protection (e.g., CRC, application tag, reference tag).

End‑to‑End Data Protection

Data traverses two paths: host memory ↔ PCIe ↔ controller ↔ NAND flash. Errors can occur on the PCIe link or within flash; metadata (guard, application tag, reference tag) provides integrity verification. Three protection modes are defined based on whether metadata is present for the data and/or the protection information.

End‑to‑end protection diagram
End‑to‑end protection diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

HardwarestorageprotocolSSDNVMePCIe
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.