How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System
This article analyzes the rapid growth challenges of Meituan‑Dianping's core payment flow, explains key availability metrics such as MTBF and MTTR, and presents a comprehensive set of architectural, operational, and tooling strategies—including dependency decoupling, timeout tuning, circuit breaking, and full‑link stress testing—to achieve stable, fault‑tolerant transactions.
Background
Every system has core metrics; for payment processing, accuracy and efficiency are paramount. Meituan‑Dianping’s intelligent payment handles 100% of transaction traffic, making stability the top priority.
Problem Trigger
Order volume surged from tens of thousands daily in early 2017 to over 7 million by year‑end, increasing payment channels, extending the processing chain, and adding product diversity (POS, QR codes, mini‑boxes, etc.). The system struggled to keep up, with frequent incidents and cascading “butterfly effects” from upstream/downstream upgrades.
Problem Analysis – Availability Metrics
High availability is measured by downtime; Meituan uses the OCTO governance platform to compute availability. Key metrics include:
Mean Time Between Failures (MTBF): average time the system runs before a failure.
Mean Time To Repair (MTTR): average time to recover from a failure.
For core transactions, the ideal is zero failures; when failures occur, impact scope (range) also matters.
Solution Overview
1. Reduce Incident Frequency
1.1 Eliminate, Weaken, and Control Dependencies (STAR Method)
Situation: Design system A to process POS payments, apply discounts, and handle loyalty points.
Task: Identify explicit and implicit requirements.
Action: Separate the payment flow into a dedicated “payment sub‑system” and isolate other functions (refund, settlement, data sync, order view) as independent services that only read order data.
Result: The payment sub‑system no longer depends on the other services, achieving dependency elimination and control.
1.2 Keep Transactions Free of External Calls
Move RPC/HTTP/message‑queue/cache operations out of database transactions to avoid long‑running transactions that can exhaust connection pools and cause system‑wide stalls. Monitor large transactions and prefer annotation‑based transaction management over XML.
1.3 Set Reasonable Timeouts and Retries
Measure 99th‑percentile response time of downstream services and set caller timeout ~50% higher (or 95th‑percentile for volatile third‑parties).
Limit retries (default three for critical services).
Improper timeout/retry settings can cause thread pool exhaustion and cascade failures (service avalanche).
1.4 Optimize Slow Queries
Separate real‑time, near‑real‑time, and offline queries; use Elasticsearch for non‑real‑time queries.
Read‑write splitting (master for writes, replicas for reads).
Limit table indexes to ≤4 per table and avoid massive tables (>10 million rows).
1.5 Implement Circuit Breaking
Use Hystrix, Meituan’s Rhino, or manual circuit breaking to fail fast when downstream services are unavailable, preventing downstream failures from propagating upstream.
1.6 “Not Self‑Destruct” Principles
Seven practices: use mature technologies, keep responsibilities single‑purpose, standardize processes, automate operations, provide capacity redundancy (≥2×), continuously refactor, and promptly patch security vulnerabilities.
1.7 “Not Be Killed by Others” – Rate Limiting
Apply token‑bucket, leaky‑bucket, or counter algorithms (e.g., Guava’s RateLimiter) and use Meituan’s OCTO throttling for Thrift interfaces. Hystrix or Rhino can also enforce custom limits.
Fault Isolation
Physical server isolation (internal vs. external), thread‑pool isolation (Hystrix command pools), and semaphore isolation limit concurrency per dependency, preventing a single failure from affecting the whole system.
Fast Fault Recovery
Discovery: Pre‑emptive load testing, fault drills, and real‑time monitoring/alerts.
Full‑Link Online Stress Testing: Replay production logs with data masking, use shadow tables, mock external calls, and employ Meituan’s pTest tool to validate capacity, bottlenecks, and protection mechanisms.
Rapid Localization: Define concise log standards and leverage component‑level monitoring for quick root‑cause analysis.
Rapid Resolution: Integrate detection, localization, and remediation tools into a unified platform to avoid context‑switching between disparate systems.
Tool Introduction
Hystrix
Implements the circuit‑breaker pattern with thread‑pool and semaphore isolation. Thread‑pool isolation runs commands in separate threads, while semaphore isolation limits concurrent calls within the caller thread.
Rhino
Meituan’s in‑house fault‑tolerance component built on CAT monitoring, offering dynamic configuration (e.g., forced circuit breaking, failure‑rate adjustment) and supporting both circuit breaking and rate limiting.
Summary & Outlook
The core transaction high‑availability effort is currently in the “first stage” of understanding past practices and learning from them. The next stage will focus on continuous improvement, deeper reliability insights, and further architectural refinements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
