Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System
The article describes how a high‑traffic payment platform achieves 99.999% availability by avoiding single points of failure, applying fail‑fast principles, implementing resource limits, building real‑time monitoring and alerting, and automating fault detection, routing, and recovery to ensure continuous 7×24 operation.
Background
Internet and enterprise applications often require 7×24 hour uninterrupted service, with availability targets ranging from three nines (99.9%) to five nines (99.999%). For a payment platform that continuously adds features and data volume, maintaining high availability is challenging.
The platform, referred to as "Fuqianla", aims for 99.999% availability when external dependencies (network, third‑party payment gateways, banks) are stable.
Problems
Typical incidents observed include:
New developers forget to set timeout values for third‑party channels, causing queue blockage.
Adding a new module in a multi‑environment, dual‑node deployment exhausts database connections.
Timeouts consume all worker threads, preventing other transactions.
Traffic spikes trigger DDoS limits on third‑party networks, making multiple channels unavailable.
Sequence number overflow due to 32‑bit field limits under high transaction volume.
These issues are common, hidden, and must be prevented.
Solution
3.1 Avoid Failures
3.1.1 Design Fault‑Tolerant Systems
Implement dynamic routing so that if one payment channel fails, the request is automatically rerouted to another channel, ensuring user‑perceived success.
Provide OOM protection similar to Tomcat by reserving memory for the application and catching OOM exceptions.
3.1.2 Apply the Fail‑Fast Principle
Terminate the main flow immediately when a critical error occurs, rather than allowing downstream impact.
If loading queue configuration fails at startup, the JVM exits.
If a transaction exceeds the 40‑second response window, the front‑end stops waiting and notifies the merchant.
If a Redis call takes longer than 50 ms, the operation is abandoned to keep latency within acceptable bounds.
3.1.3 Build Self‑Protecting Systems
Isolate third‑party dependencies (databases, external APIs) to prevent cascading failures.
Examples include:
Splitting message queues per business, merchant, and payment type to avoid cross‑impact.
Limiting resource usage: connection pools, memory consumption, thread creation, and concurrency per third‑party limits.
Using thread pools instead of unbounded thread creation.
3.2 Detect Failures Quickly
3.2.1 Real‑Time Alerting
Alerting must be second‑level, cover all business functions, provide severity levels, and support push (SMS, email) and pull (dashboard) delivery.
3.2.2 Data‑Point Collection
Each module writes key metrics to Redis; a central analysis service aggregates these points and triggers alerts without affecting transaction latency.
3.2.3 Analysis System
The analysis system processes real‑time data, identifies critical alarm points (e.g., network anomalies, order timeouts, transaction success rates), and distinguishes between alarm‑triggering and monitoring‑only events.
3.2.4 Non‑Business Monitoring
Operational metrics such as JVM GC, heap usage, thread stacks, network traffic, host health, storage I/O, and middleware status are collected via agents, Zabbix, and rsyslog.
3.2.5 Log Recording & Analysis
Every transaction generates ~30 log lines; logs are aggregated with rsyslog, parsed by the analysis system, stored in a database, and visualized for operators. Sample log format (wrapped in code tags):
2016-07-22 18:15:00.512||pool-73-thread-4||通道适配器||通道适配器-发三方后||CEX16XXXXXXX5751||16201XXXX337||||||04||9000||【结算平台消息】处理中||0000105||98XX543210||GHT||03||11||2016-07-22 18:15:00.512||张张||||01||tunnelQuery||true||||Pending||||10.100.140.101||8cff785d-0d01-4ed4-b771-cb0b1faa7f95||10.999.140.101||O001||||0.01||||||||http://10.100.444.59:8080/regression/notice||240||2016-07-20 19:06:13.000xxxxxxxLog visualization shows the full order trace and allows download of raw request/response data.
3.2.6 24/7 Monitoring Room
Operators receive alerts via SMS/email and dashboards; a dedicated monitoring center operates around the clock to ensure system stability.
3.3 Respond to Failures Promptly
3.3.1 Automatic Recovery
When a third‑party becomes unstable, the system automatically reroutes traffic to healthy channels.
3.3.2 Service Degradation
If a merchant’s traffic spikes, the platform throttles that merchant’s requests to protect overall service, while keeping core functions available.
Q&A
Various questions cover the RabbitMQ hardware failure, separation of development and operations, language choices (mostly Java), handling of third‑party dependencies, payment timeout handling, routing logic, automatic repair workflow, promotion traffic management, log storage strategy, and the relationship between system and performance monitoring.
Conclusion
By combining fault‑avoidance design, fail‑fast handling, resource limiting, real‑time monitoring, automated alerting, and rapid remediation, the payment platform achieves near‑five‑nine availability while supporting massive transaction volumes.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.