Operations 13 min read

High‑Availability Practices for Account Services at Meituan/Dianping

Meituan/Dianping ensures its critical account service stays online by combining real‑time business monitoring, circuit‑breaker‑driven graceful degradation, and active‑active cross‑region deployment with isolated dependencies, versioned data sync, and automated cache updates, dramatically extending MTBF while cutting MTTR and latency.

Meituan Technology Team

May 31, 2018

High‑Availability Practices for Account Services at Meituan/Dianping

In any internet company, regardless of its main business, the account system is a priceless asset. It is used to measure key metrics such as DAU, MAU and retention, and it provides a foundation for user profiling and downstream services.

System availability is evaluated by two indicators: MTBF (Mean Time Between Failure) – the average time a system runs without failure, and MTTR (Mean Time To Recovery) – the average time needed to recover from a failure. These two metrics together determine the familiar “nines” of availability.

1. Business Monitoring

To reduce MTTR, failures must be detected as early as possible. Business‑level monitoring focuses on the health of business metrics, such as login traffic, success/failure rates, user regions, app versions, browser types, referer sources, and data‑center locations. Because the monitoring dimensions are numerous and frequently evolving, the team stores logs in Elasticsearch.

Each monitored metric has a baseline derived from historical curves; when current values exceed a threshold, an alarm is triggered. Alarms are enriched with dimensional analysis to quickly pinpoint causes (traffic surge, attacks, logging delays, service issues) and to avoid alarm fatigue.

2. Flexible Availability

Flexible availability aims to extend MTBF by isolating failures of downstream services. Critical authentication and query services are split and their dependencies are isolated. For non‑critical paths, graceful degradation is applied. For example, when Redis becomes unstable, the service switches to an internal cache middleware (Tair) as a fallback.

The team uses Hystrix or an in‑house middleware (Rhino) to implement circuit‑breaker based degradation based on recent failure rates, achieving millisecond‑level failover and significantly improving fault tolerance.

For critical paths, the system reduces impact by showing alternative login options when a node fails, using per‑entry counters and flags to drive UI messages (32 possible messages).

3. Multi‑Active Across Regions

To further extend MTBF, the team implements cross‑city active‑active deployment. The design follows three principles: (1) if one region fails, the other provides full service; (2) both regions serve traffic simultaneously; (3) both obey BASE for eventual consistency.

The database adopts a primary‑multiple‑replica model to handle the read‑heavy (350:1) workload. Redis cannot use a simple master‑slave mode across regions due to “snowball” effects, so a dual‑master setup with custom synchronization is used.

Data synchronization relies on an internal MQ platform (Mafka, similar to Kafka). Keys are hashed to partitions to guarantee ordering per key. To resolve write conflicts, a version‑number scheme (a long integer) is stored with each value; the larger version overwrites the smaller, ensuring consistency.

Cache synchronization is optimized by using SET (add‑or‑replace) instead of DELETE, and ADD (add‑if‑absent) for loading, achieving strong consistency without extra storage. The internal Databus component publishes DB change logs to drive cache updates.

Post‑deployment, the system shows >80% reduction in average latency and TP99/TP999, and during a real network outage the account read service remained fully available.

Summary

Achieving high availability requires continuous investment, regular disaster‑recovery drills, and meticulous engineering practices. Even a tiny bug can trigger a large‑scale outage, so every line of code and configuration must be carefully reviewed. High availability should become a mindset for all developers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Availability fault tolerance Data synchronization service monitoring

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.