Databases 11 min read

How Meituan Dianping Built a Reliable MySQL Group Replication HA Architecture

This article details Meituan Dianping's practical experience deploying MySQL Group Replication (MGR) for CMDB high availability, covering background, MGR fundamentals, configuration limits, parameter tuning, architecture design, deployment timeline, typical issues, a custom Python client, and daily operational practices.

dbaplus Community
dbaplus Community
dbaplus Community
How Meituan Dianping Built a Reliable MySQL Group Replication HA Architecture

Background

MySQL Group Replication (MGR) became generally available in MySQL 5.7.17, offering built‑in high‑availability without external components. Meituan Dianping needed a self‑contained HA solution for its CMDB, which could not rely on the existing MHA‑based architecture.

About MGR

MGR is implemented as a MySQL plugin that handles conflict detection and Paxos‑based communication. It synchronises nodes via the binary log (ROW format) and therefore feels familiar to DBAs accustomed to traditional master‑slave setups. Although it resembles Percona XtraDB Cluster, MGR still relies on binlog replication.

Solution Design

The team selected the single‑primary mode of MGR to avoid known multi‑primary issues. In this mode, when the primary fails the cluster automatically elects a new primary and applications redirect writes accordingly.

Key Considerations

MGR limitations (InnoDB‑only, mandatory primary key, ROW binlog format, no SAVEPOINT, etc.)

Required pre‑deployment checks: eliminate MyISAM tables, add missing primary keys, and remove SAVEPOINT usage.

Operational concerns such as network jitter, backup tool compatibility, and Online DDL support.

Parameter Tuning

Important parameters were adjusted based on testing: group_replication_unreachable_majority_timeout: default 0 (no timeout) is unsafe; a finite timeout prevents nodes from staying UNREACHABLE indefinitely. group_replication_compression_threshold: lowering the threshold reduced large‑transaction‑induced UNREACHABLE states.

Final Architecture

A three‑data‑center, three‑node MGR cluster was deployed as the core HA layer, with an additional master‑slave cluster for disaster recovery. This hybrid design balances MGR’s novelty with proven fallback mechanisms.

Deployment Timeline

Since 2018, three critical systems (process, reporting, and CMDB) have been migrated to MGR.

Typical Problems and Resolutions

1. Large Transactions

During reporting system rollout, intermittent UNREACHABLE states were observed. Analysis linked the issue to large transactions exceeding the network transmission window. Reducing group_replication_compression_threshold and limiting transaction size eliminated the problem.

2. HANG During START/STOP

During a node‑down drill, Nginx hung because queries to performance_schema.replication_group_members blocked for ~10 seconds while MGR was stopping. Adding a timeout to these queries resolved the issue.

3. Data‑Center Failure

A bandwidth reduction in one data center triggered simultaneous MGR and master‑slave failovers. The root cause was DNS‑based connection logic that delayed address updates. The team migrated all CMDB access to an internal “Smart Client”.

Smart Client

Meituan Dianping developed an internal Python library built on MySQLdb that provides automatic primary election and read/write splitting for MGR, simplifying client‑side integration.

Daily Operations

Operational monitoring focuses on node state (any non‑ONLINE state is flagged) and a proxy for “lag” by tracking the number of pending transactions. GTID set differences between primary and replicas are also observed.

Conclusion

Extensive production testing shows that MGR can serve as a stable, reliable HA solution for DBAs, though it is less suited for write‑intensive workloads. The experience provides valuable insights for designing MySQL high‑availability architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilitymysqlDatabase operationsGroup ReplicationMGR
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.