Cloud Native 16 min read

How Meituan’s MStore Achieves Scalable Storage‑Compute Separation in Cloud‑Native Environments

This article explains how Meituan’s storage team designed the MStore distributed storage platform to separate storage and compute, addressing scaling, cost, and reliability challenges of monolithic architectures, and details its cloud‑native components, data model, performance optimizations, observability, and the derived EBS block‑storage service.

ITPUB

May 10, 2023

How Meituan’s MStore Achieves Scalable Storage‑Compute Separation in Cloud‑Native Environments

Background

Cloud‑native environments require storage‑compute separation to avoid data‑migration bottlenecks, reduce resource waste, and simplify development of diverse storage services.

MStore Architecture

MStore is a distributed storage foundation that abstracts common storage capabilities (block, file, object, table, database, big‑data) and exposes a POSIX‑like file API via an SDK.

Subsystems

RootServer : Cluster entry point that tracks all resources (MetaServers, ChunkServers, disks, pools).

MetaServer : Stores user‑data metadata (Blob‑to‑Chunk mapping, Chunk‑to‑ChunkServer mapping). Horizontally scalable and uses Raft for consistency.

ChunkServer : Handles data serialization, checksum, and serves read/write requests.

SDK : Library used by applications to access MStore through file‑system‑like APIs.

Blob and Chunk Model

A Blob is an object similar to a file, composed of 64 MiB Chunk s. Two Blob types are provided:

LogBlob : Append‑only writes, used for sequential logging.

ExtentBlob : Supports random writes; data is eventually flushed from LogBlob to ExtentBlob.

Metadata and Resource Management

Metadata is split into:

Resource information (managed by RootServer): MetaServer groups, ChunkServer list, disk inventory, PhysicalPool and LogicalPool definitions.

User data information (managed by MetaServers): Blob‑Chunk mappings and placement policies.

MetaServers form a Raft cluster to guarantee high availability. PhysicalPools group homogeneous disks; LogicalPools expose QoS‑controlled partitions to users.

Star‑Shaped Write (Star Write)

The SDK contacts RootServer/MetaServer only when acquiring a new Chunk. Data is then written concurrently to three replicas using synchronous writes, reducing network hops and ensuring strong consistency. Write requests from multiple clients are merged into a single disk I/O to lower I/O pressure.

Storage Format Header

Each disk write is prefixed with a header containing:

prev_size

curr_size

prev_crc

curr_crc

flags

self_cksum

The header enables atomic write boundaries, CRC‑based integrity checks, and fast recovery.

Versioning and Consistency

User data is versioned; the version number increments with each write, guaranteeing ordering. Reads can specify the latest version, and a data‑inspection service validates replica legality.

Observability and Tracing

MStore integrates full monitoring, alerting, and a trace pipeline that records each request’s lifecycle across MStore and its dependencies. Traces are exported to Meituan’s Mtrace platform for performance analysis.

Run‑to‑Complete (RTC) Thread Model

Each request is processed by a single thread for its entire lifetime, eliminating inter‑thread coordination for most operations. Mutually exclusive requests are handled by the thread that first acquires the resource; independent requests run in parallel threads. C++ RAII callbacks chain asynchronous actions.

User‑Space (Bare‑Metal) Storage Engine

Profiling showed that file‑system overhead dominates latency. A user‑space engine built on SPDK (or io_uring) provides a lightweight, high‑throughput path. The engine defines:

SuperBlock – global disk metadata.

BlockTable – allocation map for blocks.

ChunkTable – allocation map for chunks.

Padding – reserved unused space.

Chunk creation allocates a region in ChunkTable; deletion returns it. Writes allocate blocks in BlockTable; deletion frees them. Blocks are stored in adjacency order within BlockTable.

Performance Results

ChunkServer latency averages ~26 µs (≈11 µs spent in SPDK). Under comparable load, write throughput is nearly twice that of Ext4, with significantly lower latency. Read throughput also exceeds Ext4, though Ext4’s page cache yields slightly lower average read latency.

EBS Block Storage Built on MStore

EBS (Elastic Block Service) is the first product built on MStore. It consists of four subsystems:

BlockMaster : Metadata manager for block devices; allocates BlockServers.

BlockServer : Handles all I/O for a segment of a virtual disk.

Client : Block‑device client library.

SnapshotServer : Manages snapshots of block devices.

A virtual disk (Vdisk) is split into 64 GiB Segments , each served by a BlockServer.

Data Organization and Indexing

Each Segment contains multiple Entries , each mapped to an ExtentBlob. Writes are first logged to a Write‑Ahead Log (WAL) stored in a LogBlob, then flushed to Entries, turning random writes into sequential group commits.

The in‑memory index consists of:

BaseTable (immutable): Indexes data up to the latest checkpoint.

UpdateTable (mutable): Indexes data after the checkpoint.

Periodic checkpoints merge UpdateTable into BaseTable, simplifying read paths and reducing lock contention.

WAL Flush Process

The flush follows a strict timeline:

oldest Log ≤ Dump start ≤ Dump end ≤ Checkpoint ≤ newest Log

Regions:

Log reclaim region : [oldest Log, Dump start) – can be reclaimed.

Dump region : [Dump start, Dump end) – active dump window.

BaseTable range : [oldest Log, Checkpoint) – indexes data before checkpoint (WAL + Entry).

UpdateTable range : [Checkpoint, newest Log) – indexes data after checkpoint (WAL only).

System Reliability

Raft ensures MetaServer consistency. The star‑write model reduces coordination overhead, and the RTC thread model eliminates most inter‑thread locking. Fast redirection of failed three‑replica writes provides instant recovery.

Testing and Deployment

MStore includes a comprehensive test suite covering unit, integration, and fault‑injection tests to validate stability and performance before production rollout.

Key Images

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Performance cloud-native Metadata storage MStore

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.