Operations 13 min read

How Meituan Achieved Near‑Zero Downtime for Its Account Service

This article details Meituan's practical approaches to boosting account service reliability, covering MTBF/MTTR metrics, business‑level monitoring, flexible availability with circuit‑breaker patterns, cross‑region active‑active deployment, data synchronization techniques, and the measurable performance gains achieved.

ITPUB

Jun 5, 2018

How Meituan Achieved Near‑Zero Downtime for Its Account Service

Introduction

Every internet company maintains its own account system, which is a critical asset for measuring metrics such as DAU, MAU, and retention, as well as for building user profiles. As Meituan's business grew rapidly, the demand for high availability of the account service increased, prompting a series of engineering practices to improve reliability.

Reliability Metrics

Two key indicators are used to evaluate system availability:

MTBF (Mean Time Between Failure) – the average time the system runs without failure.

MTTR (Mean Time To Recovery) – the average time needed to recover after a failure.

These metrics translate into the familiar “nines” of availability; improving availability means either extending MTBF or reducing MTTR.

1. Business‑Level Monitoring

Early fault detection relies on comprehensive business monitoring. Unlike generic monitoring, business monitoring tracks specific metrics such as login counts, success rates, failure categories, user regions, app versions, browser types, referers, and data‑center locations.

Because the monitoring dimensions are numerous, Meituan stores logs in Elasticsearch. Each metric generates a baseline curve; deviations beyond a threshold trigger alerts.

To speed up root‑cause analysis, each alert now includes dimension‑level analysis, showing whether traffic spikes are due to promotions, attacks, or service issues, and reducing alert fatigue.

2. Flexible Availability (Graceful Degradation)

The goal is to keep core authentication and query services running even when downstream dependencies fail. Strategies include:

Service decomposition and resource isolation for downstream calls.

Fallback to alternative caches (e.g., using Tair when Redis becomes unstable).

Circuit‑breaker patterns via Hystrix or Meituan’s custom middleware Rhino, which predicts failure based on recent error rates and fails fast, achieving millisecond‑level degradation.

Metrics show a clear improvement in fault‑tolerance and reduced TP999 latency after applying these techniques.

For critical login paths, a counter‑based flag system tracks the health of each entry point; when a node is down, the counter increments, and the UI displays one of 32 possible fallback messages, guiding users to alternative login methods.

3. Active‑Active Multi‑Region Deployment

Beyond graceful degradation, Meituan implements cross‑city active‑active redundancy to further extend MTBF. The design follows three principles:

If one region fails, the other provides full service.

Both regions serve traffic simultaneously.

Both adhere to BASE (Basically Available, Soft state, Eventual consistency) for data.

3.1 Architecture Design

Meituan evaluated set‑based sharding but chose a primary‑replica database layout to suit the account service’s read‑heavy (350:1) workload. Redis cannot use simple master‑slave in two regions due to “snowball” sync failures, so a dual‑master setup with custom synchronization is employed. DNS smart routing, same‑city SLB policies, and proximity‑aware RPC complete the design.

3.2 Data Synchronization

Data is reliably transferred via the internal MQ platform Mafka (Kafka‑like). To preserve order per key, a consistent hashing algorithm maps each key to a specific partition, guaranteeing ordered processing within that partition.

For cross‑region write conflicts, Meituan adopts a Raft‑style protocol: a single leader handles writes, followers replicate, and in leader loss a follower is promoted. Each write carries a monotonically increasing version number (a 64‑bit integer); the larger version wins, ensuring eventual consistency.

Cache synchronization is optimized by avoiding delete operations. Instead, a set (add‑or‑replace) replaces delete, and add (add‑if‑absent) replaces cache loading, preserving strong consistency without extra storage. The internal Databus component streams DB change logs to update caches efficiently.

Additional safeguards include a periodic scan task that compares data between regions and repairs inconsistencies.

Results and Conclusion

After rollout, average latency and TP99/TP999 dropped by at least 80 %. During a real network outage, the account read service remained fully available, preventing larger business impact.

High availability requires continuous investment: monthly disaster‑recovery drills, vigilant code reviews, and a mindset that treats reliability as a core design principle. Even a tiny bug can cause a major outage, so every change must be carefully considered.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Monitoring High Availability Data synchronization service reliability circuit breaker Active-Active

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.