Fundamentals 25 min read

Understanding NVMe over PCIe: Architecture, Commands, and Data Structures

This article provides a comprehensive overview of the NVMe protocol over PCIe, covering its logical device interface, key terminology, SSD architecture, PCIe register configuration, controller registers, queue structures, arbitration mechanisms, PRP and SGL addressing, command sets, controller initialization, reset procedures, shutdown processes, host command examples, and advanced features such as firmware updates and end‑to‑end data protection.

Open Source Linux
Open Source Linux
Open Source Linux
Understanding NVMe over PCIe: Architecture, Commands, and Data Structures

Overview

NVMe (Non‑Volatile Memory Express) is a logical device interface specification similar to AHCI, defining a bus‑level protocol for accessing non‑volatile storage media attached via PCIe.

1. Terminology

Namespace

A namespace is a collection of logical blocks (LB) whose attributes are defined in the Identify Controller data structure.

Fused Operations

Fused operations aggregate two consecutive commands in a queue, ensuring atomic execution; only NVM commands support this.

Command Execution Order

Except for fused operations, each command in a submission queue (SQ) is independent and data‑dependency issues (e.g., RAW) are handled by the host.

Write Atomicity

The controller must support write‑unit atomicity, which can be reduced via the host‑configurable Write Atomicity feature for performance.

Metadata

Optional extra information that provides verification capabilities.

Arbitration Mechanism

Three arbitration methods are defined: round‑robin (RR), weighted RR, and custom implementations.

Logical Block (LB)

The smallest read/write unit defined by NVMe (e.g., 2 KB, 4 KB), addressed by LBA.

Queue Pair

Consists of a Submission Queue (SQ) and a Completion Queue (CQ); the host submits commands via SQ and the controller returns completions via CQ.

1.2 NVMe SSD

Basic Architecture

An NVMe SSD comprises three parts: the host driver, the PCIe‑based NVMe controller, and the storage media (FTL + NAND flash).

NVMe Controller

The controller functions as a DMA engine combined with multiple queues to exploit flash parallelism.

2 PCIe Register Configuration

PCIe Bus Structure

PCIe consists of three layers (physical, data‑link, and transport). NVMe operates at the transport (application) layer, using PCIe as the underlying bus.

Register Configuration

The specification defines three main parts: PCI header, PCI capabilities, and PCI Express extended capabilities.

PCI Header

NVMe controllers are endpoint devices (type 0) with a 64 KB header.

PCI Capabilities

Include power management, interrupt management (MSI/MSI‑X), and PCIe capabilities.

PCI Express Extended Capabilities

Cover advanced features such as error recovery.

3 NVMe Register Definition

Registers are split into controller‑wide attributes and per‑queue head/tail doorbell registers.

CAP – controller capabilities (page size, I/O command set, arbitration, etc.)

VS – version number

INTMS – interrupt mask (invalid for MSI‑X)

INTMC – interrupt mask clear (invalid for MSI‑X)

CC – controller configuration (queue element size, shutdown, arbitration, etc.)

CSTS – controller status (shutdown, fatal error, ready)

AQA – admin queue attributes (SQ and CQ size)

ASQ – admin SQ base address

ACQ – admin CQ base address

DB registers for each I/O queue start at offset 0x1000

Register Understanding

CAP describes the controller’s capabilities; CC selects which capabilities are active. Changing CC requires a reset to take effect.

Admin queues are created via AQA, ASQ, and ACQ registers; I/O queues are created with admin commands.

4 Memory Data Structures

4.1 SQ and CQ Definition

Submission Queues (SQ) hold commands; Completion Queues (CQ) hold completion entries. Queues can be empty or full; a full queue leaves one slot unused.

Queue size is a 16‑bit value: minimum 2 entries, maximum 64 K for I/O queues and 4 K for admin queues. Queues have a priority (U, H, M, L) that the host can set if supported.

4.2 Arbitration

RR

Round‑robin arbitration treats admin and I/O queues with equal priority; the controller may select multiple commands per arbitration burst.

Weighted RR

Four priority levels; higher‑priority queues are serviced first (non‑preemptive).

Vendor Specific

Custom arbitration implementations.

4.3 Data Addressing (PRP and SGL)

PRP

Physical Region Page (PRP) entries are 64‑bit physical addresses pointing to host memory pages. PRP can be used directly or via a PRP List when the data spans multiple pages.

Admin commands must use PRP; I/O commands may use PRP or SGL (Scatter‑Gather List).

SGL

SGL consists of segments and descriptors; six descriptor types are defined (data, unused, segment, last segment, keyed data, transport data).

Example: a 13 KB read split into three memory blocks (3 KB, 4 KB, 4 KB) using four SGL descriptors.

PRP vs. SGL

Both describe host memory regions; PRP maps to whole pages, while SGL can describe arbitrary contiguous regions, offering greater flexibility.

5 NVMe Commands

5.0 Command Execution Flow

The host writes commands to an SQ, updates the tail doorbell, the controller DMA‑fetches the command, executes it, writes a completion entry to the CQ, notifies the host via MSI‑X, and the host processes the completion and updates the head doorbell.

5.1 Command Classification

Commands are divided into Admin commands (manage the controller) and NVM/I/O commands (perform data transfers).

5.2 Command Format

All commands are 64 bytes with a common layout; fields vary by command.

5.3 Admin Commands

Examples include Create/Delete I/O SQ/CQ, Identify, Get/Set Features, Firmware Image Download, Firmware Activate, Async Event Request, etc. Each command is identified by an 8‑bit opcode in Dword0.

5.4 NVM Commands

Examples include Flush, Write, Read, Write Uncorrectable, Compare, Dataset Management, etc., also identified by opcode.

6 Controller Structure and Operation

The controller has three functional blocks: I/O, Admin, and Discovery.

6.1 Command Execution Process

Host writes command(s) to SQ.

Host updates SQ tail doorbell.

Controller fetches command via DMA.

Controller executes command.

Controller writes completion to CQ.

Controller signals host via interrupt.

Host processes completion.

Host updates CQ head doorbell.

6.2 Reset

Controller reset can be triggered by PCIe reset, PCI reset, or clearing CC.EN. Reset clears all I/O queues, aborts unfinished commands, sets CSTS.RDY to 0, and leaves AQA/ASQ/ACQ unchanged. After reset, the host re‑enables the controller (sets CC.EN) and re‑creates queues.

6.3 Interrupts

NVMe supports pin‑based, single‑message MSI, multi‑message MSI, and MSI‑X. MSI‑X is recommended because each CQ can have its own interrupt vector.

6.4 Initialization

Configure PCI/PCIe registers.

Wait for CSTS.RDY.

Set AQA, ASQ, ACQ.

Configure CC.

Set CC.EN.

Wait for CSTS.RDY.

Issue Identify to discover controller and namespace structures.

Get features to learn queue limits and configure interrupts.

Allocate I/O SQs and CQs.

Optionally submit Async Event Request for health monitoring.

6.5 Host Command Examples

Creating an I/O SQ, processing a completion, and a full read transaction flow are illustrated with packet captures and memory‑write/read TLPs.

7 NVMe Features

7.1 Firmware Update

Download firmware image using Firmware Image Download command.

Activate firmware with Firmware Activate command.

Reset controller.

Re‑initialize controller and re‑allocate I/O queues.

7.2 Metadata Transfer

Metadata can be attached to each logical block or transferred as a separate logical block, providing protection information such as CRC, application tag, and reference tag.

7.3 End‑to‑End Data Protection

Data protection uses a 16‑bit Guard CRC, an Application Tag linked to the LBA, and a Reference Tag tying user data to its address. Three protection modes are defined based on whether metadata is present and whether the PRACT bit is set.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ControllerSSDNVMePCIeStorage Protocol
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.