How RocketMQ Implements Random Indexing for Cloud‑Native Storage
This article explains RocketMQ's random indexing mechanism, detailing its on‑disk three‑segment hash table structure, the compact format conversion process, multi‑threaded write and query workflows, layered system design, crash‑recovery strategy, and comparisons with RocksDB and InnoDB storage engines.
Characteristics of Random Indexing in Message Systems
RocketMQ stores message indices by message ID or business keys (e.g., order number). Traditional index storage in databases or local files cannot scale to massive write workloads because of disk capacity limits.
Disk Index Structure
Each index file consists of three segments arranged as a head‑insertion hash table:
IndexHeader : metadata such as magic code, start/end timestamps, number of used slots (hashSlotCount) and total index entries (indexCount).
Slots : a fixed‑size array; each slot holds the file offset of the head node of a singly‑linked list for entries that hash to the same slot.
IndexItems : records containing topicId, queueId, offset, size and other fields needed to locate the original message in the CommitLog.
Compact Format Conversion
The index module is write‑heavy and read‑light, so a small amount of read amplification is acceptable. Let t1 be the write cost, t2 the average time before a query, t_compact the time to compact, t_before the query latency before compaction and t_after after compaction. Because t_compact << t2, compaction can run asynchronously without affecting query latency, and t_after < t_before:
t1 + t2 + t_before > t1 + t2 + t_after
Lifecycle of a Single Index File
An index file moves through three states:
unsealed : actively written.
compacted : write‑stop, ready for upload.
uploaded : stored in object storage.
When the file reaches its capacity it is marked compacted , uploaded, and eventually expired.
Storage Model for Multiple Index Files
Multiple index files are managed as a set, each with an independent lifecycle. New files are created when the current file is full; each file can be in any of the three states described above.
System Layered Design
Index Service Layer : provides indexing APIs, manages file lifecycles, and coordinates write, query and background tasks.
Index File Parsing Layer : parses individual index files and exposes KV‑style queries and format conversion.
Data Storage Layer : handles binary I/O to local disks, object storage, or databases.
High‑Availability Crash Recovery
On restart the system scans directories named after file states (e.g., writing, compact, upload), loads each index file into memory, and rebuilds an in‑memory skip‑list that tracks file locations and statuses.
Comparison with Other Storage Engines
RocksDB uses Log‑Structured Merge (LSM) trees with asynchronous compaction, which improves read performance but incurs significant write amplification.
MySQL InnoDB relies on B+‑tree structures and redo logs; it offers high write throughput but its index scalability is limited for very large tables.
RocketMQ’s append‑only, time‑ordered design enables simple hot‑cold separation and asynchronous format conversion, reducing overall latency and avoiding the heavy write amplification of LSM‑tree compaction.
Remaining Issues and Future Improvements
The current design lacks an efficient global maxCount limit. Queries may need to scan all index files before determining that the required number of results has been found, causing unnecessary I/O. Introducing a thread‑safe global counter would allow early termination once maxCount is reached.
Extending IndexItem via inheritance would enable the service to support additional systems without rewriting the core indexing logic.
Reference Documents
1. Zhang, H., Wu, X., & Freedman, M. J. (2008). PacificA: Replication in Log‑Based Distributed Storage Systems.
https://www.microsoft.com/en-us/research/wp-content/uploads/2008/02/tr-2008-25.pdf2. RocksDB Compactions. https://github.com/facebook/rocksdb/wiki/Compaction 3. Inside InnoDB: The InnoDB Storage Engine.
https://dev.mysql.com/doc/refman/8.0/en/innodb-internals.htmlSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
