In‑Depth Analysis of WiredTiger Storage Engine Architecture and Performance
This article examines WiredTiger’s novel data‑organization design, its in‑memory page and disk‑extent structures, skiplist‑based insert handling, MVCC update lists, compression mechanisms, and performance test results, providing practical configuration advice for MongoDB deployments.
WiredTiger (WT) is a modern storage engine for MongoDB that redesigns the traditional page‑based architecture to fully exploit multi‑core CPUs, large memory, and fast disks, achieving 7‑10× write performance improvements.
Traditional engines use a fixed page size (e.g., InnoDB 16 KB) with strict latch rules for page access, limiting concurrency. WT separates the in‑memory page (a loose data structure) from the on‑disk extent (a variable‑length block), allowing lock‑free multi‑core operations and flexible compression.
The WT data organization consists of two parts:
In‑memory page: a loosely structured page stored in RAM.
Disk extent: a serialized block stored on disk.
WT’s in‑memory page includes several page types (row internal, row leaf, column internal, column fix leaf, column var leaf). The article focuses on the row leaf page, which stores key/value pairs.
Key structures:
wt_row{
uint64 kv_pos; // position of the KV in the page_disk
}Rows are linked to three possible flag formats:
CELL_FLAG (0x01): both key and value may be stored in an overflow cell.
K_FLAG (0x02): only the key position is stored; the value follows the key.
KV_FLAG (0x03): both key and value positions are stored.
WT uses a skiplist ( wt_insert ) for newly inserted K/V pairs, enabling O(log n) operations, lock‑free reads, and ordered range queries.
wt_insert{
key_offset; // offset of key in buffer
key_size; // length of key
value; // pointer to MVCC list head (wt_update)
next[]; // pointers for each skiplist level
key_data[]; // buffer storing the key bytes
}Updates are stored in an MVCC list ( wt_update ) that is appended atomically using CAS, providing lock‑free reads.
wt_update{
txnid; // transaction ID that generated the update
next ptr; // next update in the list
size; // length of the value (0 indicates deletion)
value[]; // buffer holding the value
}When a page’s size exceeds the configured limit, WT splits the page and creates overflow pages (up to ~4 GB per K/V). Overflow pages are cached in a skiplist for fast lookup.
Disk extents contain a page header, block header (with checksum), and the serialized cell data. The read path (in‑memory) loads the extent, verifies checksum, optionally decompresses, and reconstructs the in‑memory page. The write path (reconcile) serializes rows into cells, optionally compresses them, builds a new extent, and writes it to the B‑tree file.
Compression is plug‑in based; WT ships with LZO, ZIP, and Snappy, and custom compressors can be loaded via wiredtiger_open :
wiredtiger_open(db_path, NULL, "extensions=[/usr/local/lib/libwiredtiger_zlib.so]", &connection);Collections can enable compression when created:
session->create(session, "mytable", "block_compressor=zlib");Performance tests on a typical development machine (i7‑4710MQ, 4 GB RAM, 1 TB SATA) compared compressed vs. uncompressed extents using 16 concurrent threads. Results showed:
Disk space reduced from ~30 GB to ~2 GB for 200 M records when compressed.
Write throughput similar for small data sizes, but compressed writes outperformed uncompressed writes (≈790 K TPS vs. 351 K TPS) for larger datasets.
Read performance favored uncompressed pages while the data fit within the OS page cache; once the cache was exceeded, compressed reads became faster due to smaller I/O.
The analysis concludes that WT’s distinct in‑memory and on‑disk structures enable high concurrency and compression benefits, but the extra copying between structures can increase memory pressure on low‑memory systems.
Practical configuration advice includes:
For read‑heavy, small tables (<2 GB), use small page sizes (16 KB) and disable compression.
For large read‑heavy tables, enable compression (e.g., zlib) while keeping modest page sizes.
For write‑heavy tables, increase leaf page size (up to 1 MB), enable compression, and limit OS cache usage.
Future work will explore WT’s B‑tree/LSM indexing and disk I/O modules.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.