Backend Development 15 min read

How Meituan Scaled Its Payment Gateway from Simple to Flexible High‑Availability Architecture

This article chronicles Meituan's payment channel gateway evolution—detailing early usable stages, subsequent availability improvements, and the implementation of a flexible, fault‑tolerant architecture that achieved 99.99% API uptime and record‑high transaction throughput.

ITPUB

Jun 13, 2016

How Meituan Scaled Its Payment Gateway from Simple to Flexible High‑Availability Architecture

Stage 1: Usable Phase

When traffic was low, the gateway performed three basic functions—initiate payment, receive success notifications, and refund to the original account. The design emphasized speed and simplicity, allowing rapid integration of new third‑party payment channels.

Stage 2: Available Phase

Rapid growth introduced new challenges:

All business logic ran on a single physical deployment unit, causing cross‑impact (e.g., refund failures dragging down payment processing).

Database load increased, leading to occasional instability.

Payment and refund status relied heavily on asynchronous notifications from third‑party channels, making the system vulnerable to external failures.

Solutions applied:

Service Splitting by Business Type : Separate physical units for payment and refund services, prioritizing payment traffic.

Master‑Slave Database Architecture : Added slaves for read scaling, forced master writes for strong consistency, and used Zebra middleware for load balancing and disaster recovery.

Active Status Synchronization : Periodic batch queries complemented asynchronous notifications; an internal API handled manual reconciliation for remaining cases.

After these changes, core service availability exceeded 99.9%.

Stage 3: Flexible Availability Phase

Further growth exposed additional risks:

New team members often chose familiar but sub‑optimal HTTP client libraries, leading to repeated integration bugs.

Even with business‑type isolation, a single payment channel could monopolize the shared RPC connection pool, causing cascading failures.

Third‑party channel failures were invisible to the gateway, harming user experience.

External services could still access the gateway's database, threatening stability (the so‑called “green‑hat” issue).

Refund workflows lacked unified collection, classification, and monitoring, resulting in unresolved cases and customer complaints.

Key engineering responses:

Unified Channel Integration Framework

Abstracted request assembly, execution, response parsing, and retry logic, hiding low‑level HTTP/Socket details. For bank channels requiring a front‑end proxy, Netty‑based connection pools and a simple round‑robin load balancer were added. The framework also provided parsers for binary/XML/JSON, certificate loaders, and cryptographic utilities.

Fail‑Fast Mechanism for Faulty Channels

Each payment request follows a fail‑fast path defined by channel → payment method → bank. Two switches control rapid failure:

Static on/off switch configured manually.

Dynamic switch based on historical health metrics (total requests, failures, timeouts) and a state machine with three states: closed (all pass), half_open (partial pass), open (all fail).

The mechanism added only 1–5 ms latency (≈1‑2% of total request time) and proved effective during real‑world channel outages.

End‑to‑End Payment Chain Monitoring

Implemented second‑level metrics for total successful payments and success rate, triggering email/SMS alerts and allowing manual degradation during traffic spikes.

Database Access Isolation

Collaborated with DBAs to revoke direct DB access from non‑gateway services and replace it with controlled APIs, improving stability and capacity planning.

Refund Case Consolidation

Aggregated abnormal refund cases across systems, introduced core refund success‑rate metrics (daily, next‑day, 7‑day), and prepared for a unified refund‑chain optimization.

These practices lifted overall gateway availability to 99.99%, handled peak traffic during major sales events, and achieved record‑high TPS for third‑party payment requests while maintaining stable core APIs.

Experience and Summary

Key takeaways include:

Keep the core principle of “big system, small parts, simple design”.

Rapid detection, recovery, and resolution are essential for long‑term reliability.

Technical excellence alone isn’t enough; disciplined processes and people are critical.

High traffic and concurrency present both challenges and opportunities for engineers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

scalability System Design payment gateway high-availability

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.