How Dazhong Dianping Scaled Its Payment Gateway: Backend Architecture and Fail‑Fast Lessons

Facing rapid business growth, Dazhong Dianping’s payment gateway evolved through usable, flexible, and highly available stages, employing service splitting, master‑slave databases, fail‑fast mechanisms, and comprehensive monitoring to achieve 99.99% availability and handle peak traffic during major sales events.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
How Dazhong Dianping Scaled Its Payment Gateway: Backend Architecture and Fail‑Fast Lessons

Rapid business growth demanded a payment system that could iterate quickly while maintaining scalability, availability, and strong data consistency. This article outlines the evolution of Dazhong Dianping’s payment channel gateway, sharing insights and practices from its development.

1. Usable Stage

In the early low‑traffic phase, the gateway handled three simple tasks: initiating payment requests, receiving success notifications, and processing refunds. The focus was on short, fast integration of new third‑party channels. Architecture diagram (Fig. 1) illustrates this stage.

2. Available Stage

As more third‑party channels were added, issues emerged: (1) single physical deployment caused cross‑impact between services; (2) database pressure grew, affecting stability; (3) reliance on asynchronous notifications led to poor user experience during channel failures.

To address (1), services were split into multiple physical units. Two strategies were considered: splitting by channel or by business type. Given traffic patterns, the team chose business‑type splitting, separating payment and refund services.

For (2), a master‑slave database setup with added slaves reduced query load, enforced master usage for strong consistency, and employed Zebra middleware for load balancing and disaster recovery.

For (3), the team added proactive status polling and batch synchronization, plus an internal API for manual reconciliation of unsynced cases.

After these improvements, core service availability exceeded 99.9%. Architecture diagram (Fig. 2) shows the evolved system.

3. Flexible Available Stage

Further growth exposed new challenges: team expansion leading to inconsistent integration choices, channel interference via shared RPC connection pools, lack of visibility into third‑party channel failures, external DB access risks, and insufficient refund monitoring.

To mitigate channel integration risks, a unified gateway framework was built, abstracting request assembly, execution, response parsing, and retry logic. It hides HTTP/Socket details and provides extensibility points. For bank channels requiring a front‑end proxy, Netty‑based connection pools and round‑robin load balancing were added. The framework also includes parsers for binary/XML/JSON, keystore/truststore loaders, and encryption/signature utilities. Framework flow diagram (Fig. 3) illustrates this design.

To prevent channel interference, a fail‑fast mechanism was introduced. Each payment request follows a fail‑fast path with static (manual on/off) and dynamic (health‑based) circuit breakers. The dynamic breaker tracks request statistics and transitions among closed, half‑open, and open states via a state machine (Fig. 4).

Fail‑fast adds only 1–5 ms latency (≈1–2% of total payment time) and, after tuning, stabilizes the online environment. During a recent channel outage, the mechanism effectively isolated the failure.

For end‑to‑end payment monitoring, the system tracks success rates and total counts per second, triggering email/SMS alerts and allowing manual degradation during peaks. This improves fault response speed.

External systems' direct DB access was revoked, replaced with API gateways to enhance DB stability and capacity planning.

Refund anomalies are now collected, categorized, and monitored via core metrics (same‑day, next‑day, 7‑day success rates), with further optimization planned.

Overall, the gateway’s availability rose to 99.99%, handling record‑high TPS during the 917 promotion and maintaining stability even when some third‑party channels failed, thanks to automatic detection, recovery, and coordinated channel switching.

4. Lessons Learned

Maintain the core principle: split and decouple, keeping large systems small and simple.

Expect issues; focus on rapid detection, recovery, and resolution.

High availability depends not only on technology but also on disciplined engineering practices.

High traffic and concurrency present both challenges and opportunities for engineers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendSystem ArchitectureScalabilityPayment Gatewayfail-fast
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.