How Meituan Scaled Its Payment Gateway from Simple to Flexible High‑Availability Architecture
This article chronicles Meituan's payment channel gateway evolution—detailing early usable stages, subsequent availability improvements, and the implementation of a flexible, fault‑tolerant architecture that achieved 99.99% API uptime and record‑high transaction throughput.
Stage 1: Usable Phase
When traffic was low, the gateway performed three basic functions—initiate payment, receive success notifications, and refund to the original account. The design emphasized speed and simplicity, allowing rapid integration of new third‑party payment channels.
Stage 2: Available Phase
Rapid growth introduced new challenges:
All business logic ran on a single physical deployment unit, causing cross‑impact (e.g., refund failures dragging down payment processing).
Database load increased, leading to occasional instability.
Payment and refund status relied heavily on asynchronous notifications from third‑party channels, making the system vulnerable to external failures.
Solutions applied:
Service Splitting by Business Type : Separate physical units for payment and refund services, prioritizing payment traffic.
Master‑Slave Database Architecture : Added slaves for read scaling, forced master writes for strong consistency, and used Zebra middleware for load balancing and disaster recovery.
Active Status Synchronization : Periodic batch queries complemented asynchronous notifications; an internal API handled manual reconciliation for remaining cases.
After these changes, core service availability exceeded 99.9%.
Stage 3: Flexible Availability Phase
Further growth exposed additional risks:
New team members often chose familiar but sub‑optimal HTTP client libraries, leading to repeated integration bugs.
Even with business‑type isolation, a single payment channel could monopolize the shared RPC connection pool, causing cascading failures.
Third‑party channel failures were invisible to the gateway, harming user experience.
External services could still access the gateway's database, threatening stability (the so‑called “green‑hat” issue).
Refund workflows lacked unified collection, classification, and monitoring, resulting in unresolved cases and customer complaints.
Key engineering responses:
Unified Channel Integration Framework
Abstracted request assembly, execution, response parsing, and retry logic, hiding low‑level HTTP/Socket details. For bank channels requiring a front‑end proxy, Netty‑based connection pools and a simple round‑robin load balancer were added. The framework also provided parsers for binary/XML/JSON, certificate loaders, and cryptographic utilities.
Fail‑Fast Mechanism for Faulty Channels
Each payment request follows a fail‑fast path defined by channel → payment method → bank. Two switches control rapid failure:
Static on/off switch configured manually.
Dynamic switch based on historical health metrics (total requests, failures, timeouts) and a state machine with three states: closed (all pass), half_open (partial pass), open (all fail).
The mechanism added only 1–5 ms latency (≈1‑2% of total request time) and proved effective during real‑world channel outages.
End‑to‑End Payment Chain Monitoring
Implemented second‑level metrics for total successful payments and success rate, triggering email/SMS alerts and allowing manual degradation during traffic spikes.
Database Access Isolation
Collaborated with DBAs to revoke direct DB access from non‑gateway services and replace it with controlled APIs, improving stability and capacity planning.
Refund Case Consolidation
Aggregated abnormal refund cases across systems, introduced core refund success‑rate metrics (daily, next‑day, 7‑day), and prepared for a unified refund‑chain optimization.
These practices lifted overall gateway availability to 99.99%, handled peak traffic during major sales events, and achieved record‑high TPS for third‑party payment requests while maintaining stable core APIs.
Experience and Summary
Key takeaways include:
Keep the core principle of “big system, small parts, simple design”.
Rapid detection, recovery, and resolution are essential for long‑term reliability.
Technical excellence alone isn’t enough; disciplined processes and people are critical.
High traffic and concurrency present both challenges and opportunities for engineers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
