Operations 20 min read

How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System

This article analyzes the rapid growth challenges of Meituan‑Dianping's core payment flow, explains key availability metrics such as MTBF and MTTR, and presents a comprehensive set of architectural, operational, and tooling strategies—including dependency decoupling, timeout tuning, circuit breaking, and full‑link stress testing—to achieve stable, fault‑tolerant transactions.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System

Background

Every system has core metrics; for payment processing, accuracy and efficiency are paramount. Meituan‑Dianping’s intelligent payment handles 100% of transaction traffic, making stability the top priority.

Problem Trigger

Order volume surged from tens of thousands daily in early 2017 to over 7 million by year‑end, increasing payment channels, extending the processing chain, and adding product diversity (POS, QR codes, mini‑boxes, etc.). The system struggled to keep up, with frequent incidents and cascading “butterfly effects” from upstream/downstream upgrades.

Problem Analysis – Availability Metrics

High availability is measured by downtime; Meituan uses the OCTO governance platform to compute availability. Key metrics include:

Mean Time Between Failures (MTBF): average time the system runs before a failure.

Mean Time To Repair (MTTR): average time to recover from a failure.

For core transactions, the ideal is zero failures; when failures occur, impact scope (range) also matters.

Solution Overview

1. Reduce Incident Frequency

1.1 Eliminate, Weaken, and Control Dependencies (STAR Method)

Situation: Design system A to process POS payments, apply discounts, and handle loyalty points.

Task: Identify explicit and implicit requirements.

Action: Separate the payment flow into a dedicated “payment sub‑system” and isolate other functions (refund, settlement, data sync, order view) as independent services that only read order data.

Result: The payment sub‑system no longer depends on the other services, achieving dependency elimination and control.

1.2 Keep Transactions Free of External Calls

Move RPC/HTTP/message‑queue/cache operations out of database transactions to avoid long‑running transactions that can exhaust connection pools and cause system‑wide stalls. Monitor large transactions and prefer annotation‑based transaction management over XML.

1.3 Set Reasonable Timeouts and Retries

Measure 99th‑percentile response time of downstream services and set caller timeout ~50% higher (or 95th‑percentile for volatile third‑parties).

Limit retries (default three for critical services).

Improper timeout/retry settings can cause thread pool exhaustion and cascade failures (service avalanche).

1.4 Optimize Slow Queries

Separate real‑time, near‑real‑time, and offline queries; use Elasticsearch for non‑real‑time queries.

Read‑write splitting (master for writes, replicas for reads).

Limit table indexes to ≤4 per table and avoid massive tables (>10 million rows).

1.5 Implement Circuit Breaking

Use Hystrix, Meituan’s Rhino, or manual circuit breaking to fail fast when downstream services are unavailable, preventing downstream failures from propagating upstream.

1.6 “Not Self‑Destruct” Principles

Seven practices: use mature technologies, keep responsibilities single‑purpose, standardize processes, automate operations, provide capacity redundancy (≥2×), continuously refactor, and promptly patch security vulnerabilities.

1.7 “Not Be Killed by Others” – Rate Limiting

Apply token‑bucket, leaky‑bucket, or counter algorithms (e.g., Guava’s RateLimiter) and use Meituan’s OCTO throttling for Thrift interfaces. Hystrix or Rhino can also enforce custom limits.

Fault Isolation

Physical server isolation (internal vs. external), thread‑pool isolation (Hystrix command pools), and semaphore isolation limit concurrency per dependency, preventing a single failure from affecting the whole system.

Fast Fault Recovery

Discovery: Pre‑emptive load testing, fault drills, and real‑time monitoring/alerts.

Full‑Link Online Stress Testing: Replay production logs with data masking, use shadow tables, mock external calls, and employ Meituan’s pTest tool to validate capacity, bottlenecks, and protection mechanisms.

Rapid Localization: Define concise log standards and leverage component‑level monitoring for quick root‑cause analysis.

Rapid Resolution: Integrate detection, localization, and remediation tools into a unified platform to avoid context‑switching between disparate systems.

Tool Introduction

Hystrix

Implements the circuit‑breaker pattern with thread‑pool and semaphore isolation. Thread‑pool isolation runs commands in separate threads, while semaphore isolation limits concurrent calls within the caller thread.

Rhino

Meituan’s in‑house fault‑tolerance component built on CAT monitoring, offering dynamic configuration (e.g., forced circuit breaking, failure‑rate adjustment) and supporting both circuit breaking and rate limiting.

Summary & Outlook

The core transaction high‑availability effort is currently in the “first stage” of understanding past practices and learning from them. The next stage will focus on continuous improvement, deeper reliability insights, and further architectural refinements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringMicroservicesOperationshigh availabilitydependency managementsystem reliabilityrate limitingcircuit breaker
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.