How Meituan’s MStore Achieves Scalable Storage‑Compute Separation in Cloud‑Native Environments
This article explains how Meituan’s storage team designed the MStore distributed storage platform to separate storage and compute, addressing scaling, cost, and reliability challenges of monolithic architectures, and details its cloud‑native components, data model, performance optimizations, observability, and the derived EBS block‑storage service.
Background
Cloud‑native environments require storage‑compute separation to avoid data‑migration bottlenecks, reduce resource waste, and simplify development of diverse storage services.
MStore Architecture
MStore is a distributed storage foundation that abstracts common storage capabilities (block, file, object, table, database, big‑data) and exposes a POSIX‑like file API via an SDK.
Subsystems
RootServer : Cluster entry point that tracks all resources (MetaServers, ChunkServers, disks, pools).
MetaServer : Stores user‑data metadata (Blob‑to‑Chunk mapping, Chunk‑to‑ChunkServer mapping). Horizontally scalable and uses Raft for consistency.
ChunkServer : Handles data serialization, checksum, and serves read/write requests.
SDK : Library used by applications to access MStore through file‑system‑like APIs.
Blob and Chunk Model
A Blob is an object similar to a file, composed of 64 MiB Chunk s. Two Blob types are provided:
LogBlob : Append‑only writes, used for sequential logging.
ExtentBlob : Supports random writes; data is eventually flushed from LogBlob to ExtentBlob.
Metadata and Resource Management
Metadata is split into:
Resource information (managed by RootServer): MetaServer groups, ChunkServer list, disk inventory, PhysicalPool and LogicalPool definitions.
User data information (managed by MetaServers): Blob‑Chunk mappings and placement policies.
MetaServers form a Raft cluster to guarantee high availability. PhysicalPools group homogeneous disks; LogicalPools expose QoS‑controlled partitions to users.
Star‑Shaped Write (Star Write)
The SDK contacts RootServer/MetaServer only when acquiring a new Chunk. Data is then written concurrently to three replicas using synchronous writes, reducing network hops and ensuring strong consistency. Write requests from multiple clients are merged into a single disk I/O to lower I/O pressure.
Storage Format Header
Each disk write is prefixed with a header containing:
prev_size
curr_size
prev_crc
curr_crc
flags
self_cksum
The header enables atomic write boundaries, CRC‑based integrity checks, and fast recovery.
Versioning and Consistency
User data is versioned; the version number increments with each write, guaranteeing ordering. Reads can specify the latest version, and a data‑inspection service validates replica legality.
Observability and Tracing
MStore integrates full monitoring, alerting, and a trace pipeline that records each request’s lifecycle across MStore and its dependencies. Traces are exported to Meituan’s Mtrace platform for performance analysis.
Run‑to‑Complete (RTC) Thread Model
Each request is processed by a single thread for its entire lifetime, eliminating inter‑thread coordination for most operations. Mutually exclusive requests are handled by the thread that first acquires the resource; independent requests run in parallel threads. C++ RAII callbacks chain asynchronous actions.
User‑Space (Bare‑Metal) Storage Engine
Profiling showed that file‑system overhead dominates latency. A user‑space engine built on SPDK (or io_uring) provides a lightweight, high‑throughput path. The engine defines:
SuperBlock – global disk metadata.
BlockTable – allocation map for blocks.
ChunkTable – allocation map for chunks.
Padding – reserved unused space.
Chunk creation allocates a region in ChunkTable; deletion returns it. Writes allocate blocks in BlockTable; deletion frees them. Blocks are stored in adjacency order within BlockTable.
Performance Results
ChunkServer latency averages ~26 µs (≈11 µs spent in SPDK). Under comparable load, write throughput is nearly twice that of Ext4, with significantly lower latency. Read throughput also exceeds Ext4, though Ext4’s page cache yields slightly lower average read latency.
EBS Block Storage Built on MStore
EBS (Elastic Block Service) is the first product built on MStore. It consists of four subsystems:
BlockMaster : Metadata manager for block devices; allocates BlockServers.
BlockServer : Handles all I/O for a segment of a virtual disk.
Client : Block‑device client library.
SnapshotServer : Manages snapshots of block devices.
A virtual disk (Vdisk) is split into 64 GiB Segments , each served by a BlockServer.
Data Organization and Indexing
Each Segment contains multiple Entries , each mapped to an ExtentBlob. Writes are first logged to a Write‑Ahead Log (WAL) stored in a LogBlob, then flushed to Entries, turning random writes into sequential group commits.
The in‑memory index consists of:
BaseTable (immutable): Indexes data up to the latest checkpoint.
UpdateTable (mutable): Indexes data after the checkpoint.
Periodic checkpoints merge UpdateTable into BaseTable, simplifying read paths and reducing lock contention.
WAL Flush Process
The flush follows a strict timeline:
oldest Log ≤ Dump start ≤ Dump end ≤ Checkpoint ≤ newest LogRegions:
Log reclaim region : [oldest Log, Dump start) – can be reclaimed.
Dump region : [Dump start, Dump end) – active dump window.
BaseTable range : [oldest Log, Checkpoint) – indexes data before checkpoint (WAL + Entry).
UpdateTable range : [Checkpoint, newest Log) – indexes data after checkpoint (WAL only).
System Reliability
Raft ensures MetaServer consistency. The star‑write model reduces coordination overhead, and the RTC thread model eliminates most inter‑thread locking. Fast redirection of failed three‑replica writes provides instant recovery.
Testing and Deployment
MStore includes a comprehensive test suite covering unit, integration, and fault‑injection tests to validate stability and performance before production rollout.
Key Images
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
