Databases 34 min read

Meituan Large‑Scale KV Storage: Challenges and Architectural Practices

The article details Meituan’s evolution of KV storage, analyzes scalability and availability challenges of both in‑memory (Squirrel) and persistent (Cellar) systems, and presents concrete architectural solutions such as gossip optimization, fork‑less RDB, multi‑threading, bulkload, and cross‑region disaster recovery, while outlining future directions like Zookeeper removal and vector engine support.

Meituan Technology Team

Mar 14, 2024

Meituan Large‑Scale KV Storage: Challenges and Architectural Practices

1. KV Storage Evolution at Meituan

Meituan’s KV service supports trillions of daily requests with 99.995% availability. Early designs used client‑side consistent hashing with many Memcached instances, which suffered data loss on node failures and during capacity expansion.

Adopting Redis introduced a master‑slave topology with Sentinel failover, but consistent hashing still caused data loss during scaling.

In 2014 Meituan integrated Alibaba’s open‑source Tair, adding a central routing service and data migration to preserve data during expansion, yet faced issues like lack of distributed arbitration (risk of split‑brain) and limited data structures compared to Redis.

Recognizing that open‑source solutions could not fully meet Meituan’s scale, the team built two custom systems: the in‑memory high‑throughput KV store Squirrel and the persistent, large‑capacity store Cellar .

2. Large‑Scale KV Challenges

Two main challenges are scalability and availability. Horizontal scaling hits limits when node count grows, causing resource waste and long‑tail latency (e.g., mget requests). Vertical scaling is needed for batch‑heavy workloads. Maintaining the same availability as smaller clusters becomes increasingly difficult.

3. In‑Memory KV – Squirrel

3.1 Horizontal Scaling Issues

Gossip traffic grows quadratically with node count; a 900‑node cluster consumes ~12% CPU on gossip, crowding out request‑processing threads and causing timeouts.

3.2 Gossip Optimization

Implemented Merkle‑tree digests to reduce gossip payload by >90%, added periodic full‑state sync to handle hash collisions, and moved gossip handling to a dedicated heartbeat thread, allowing request threads lock‑free reads.

3.3 Vertical Scaling Issues

Large memory nodes trigger costly fork‑based RDB snapshots (e.g., an 8 GB node incurs ~500 ms pause), and read‑only scaling via replicas cannot keep up with write demand.

3.4 Fork‑less RDB

Created a snapshot mechanism that pauses hash‑table rehashing, iterates keys, dumps each to a replication queue with a cursor, and captures concurrent updates, eliminating fork‑induced pauses while handling large keys.

3.5 Work‑Thread Multi‑Threading

Extended community IO‑threading by parallelizing the entire request processing path (run‑to‑completion model). This reduced thread switches and expanded critical sections, achieving a 70% throughput gain over community IO‑threading and >3× over single‑threaded designs.

3.6 Availability Enhancements

Introduced witness nodes (inspired by Google Spanner) to achieve quorum with fewer physical machines, reducing cross‑datacenter resource requirements while preventing split‑brain scenarios.

3.7 Two‑Datacenter Disaster Recovery

Deployed witness nodes with weighted voting, allowing a large cluster to maintain high availability with effectively two‑datacenter resources.

3.8 Cross‑Region Disaster Recovery

Implemented a cluster‑inter‑sync service that pulls RDB and incremental logs from a remote cluster, then writes them as normal requests, enabling bidirectional synchronization and allowing services to write locally while reading from both regions.

3.9 Automatic Conflict Resolution

Adopted a last‑write‑wins strategy based on timestamps: newer timestamps overwrite older ones; when timestamps tie, the larger cluster ID wins. Additional mechanisms store recent timestamps, cluster IDs, and deleted keys to handle clock rollback, concurrent writes, and delete‑write races.

4. Persistent KV – Cellar

4.1 Vertical Scaling Challenges

Cellar uses an LSM‑Tree engine, leading to write amplification and lower write throughput compared to reads. Increasing thread count raises inter‑thread synchronization overhead.

4.2 Bulkload Data Import

Bulkload routes large offline data through an S3‑like object store, allowing clients to generate sorted shard files locally, upload them, and have storage nodes ingest whole files directly, reducing memory sorting and compaction pressure. This improves import speed by 5× (e.g., a 14‑hour job reduced to <3 hours).

4.3 Thread Scheduling Model Optimization

Original design used a single work‑thread pool, causing slow offline imports to block fast online reads. Re‑architected to four queues + four thread pools (fast‑read, slow‑read, fast‑write, slow‑write), isolating request types.

Added dynamic thread sharing: idle threads from less‑busy pools poll busy queues. However, sharing increased CPU usage at high load, so a dedicated scheduler thread now monitors load and a separate idle‑thread pool performs real‑time adjustments, cutting CPU overhead and boosting throughput >30%.

4.4 RTC (Run‑to‑Completion) Model

Network threads now handle read‑only requests directly when the memory engine hits, bypassing the work‑thread pipeline, reducing CPU cache misses and improving read throughput by >30% at 80% hit rates.

4.5 Lock‑Free Memory Engine

Replaced lock‑protected hash map with a single‑writer‑multiple‑reader lock‑free linked list and introduced RCU for asynchronous reclamation. SlabManager now uses per‑size locks. This yields >30% read performance gains.

4.6 Availability Challenges

Cellar also employs cross‑region sync, embedding replication logic within storage nodes to simplify deployment and reduce operational cost.

4.7 Automatic Conflict Resolution

Cellar uses Hybrid Logical Clock (HLC) timestamps to guarantee monotonicity, avoiding timestamp ties; when ties occur, cluster IDs decide the winner.

5. Future Plans and Industry Trends

Remove ZooKeeper dependencies: replace with internal configuration/notification service for Squirrel and Raft‑based metadata store for Cellar.

Integrate vector engine capabilities to support large‑model training and inference workloads.

Explore cloud‑native deployment and orchestration to lower operational costs.

Investigate kernel‑bypass technologies (io_uring, DPDK, SPDK) to further boost I/O throughput.

Evaluate hardware accelerators such as compression‑offload SSDs and RDMA networking for latency reduction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scalability Squirrel conflict resolution KV storage bulkload Cellar forkless RDB gossip optimization

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.