How to Build a High‑Availability Payment System with Smart Routing
This article explains how a fintech payment platform achieves high availability and optimal channel selection by using decision‑tree routing, sliding‑window negative‑feedback, pressure‑detection services, and component fallback strategies such as RabbitMQ with Redis, supporting millions of daily transactions.
Introduction
In internet finance, the payment system connects financial companies with payment channels, requiring rich routing strategies, efficient processing, and high availability to ensure satisfactory financial services.
Our system has integrated over twenty payment channels and more than thirty payment products, supporting various online payment methods to meet diverse user needs.
The core challenges include quickly selecting the optimal channel for each request, balancing success rate and cost, ensuring system availability, and mitigating the impact of channel anomalies.
Our solution evaluates channel availability, response latency, fees, and success rates, automatically reducing the weight of unavailable channels to select the most suitable one.
Characteristics of the Payment System
Modern payment systems must consider real‑time channel availability, stability, and business constraints, not just cost and success rate, to provide better service experiences.
We incorporate abnormal scenarios into routing strategies, supporting tens of millions of daily repayments.
How does the system handle channel anomalies and system pressure at massive transaction scales? The following sections illustrate key functionalities.
High Availability Guarantees
When a third‑party channel fails, we use degradation and circuit‑breaker techniques, but these can degrade user experience. Instead, we adjust channel routing weights based on a sliding‑window algorithm that monitors anomaly ratios and counts, providing negative feedback to the routing engine to lower the weight of problematic channels without fully disabling them.
1. Precise and Efficient Payment Routing
We designed a multi‑scenario routing algorithm based on a channel decision‑tree. The algorithm builds a tree model from conditions specific to the payment method, then traverses the tree to sort candidate channels.
The decision‑tree offers flexible node creation, zero‑code configuration for new strategies, and O(n log n) complexity for fast sorting.
The routing strategy includes manually configured channel constraints (e.g., bank support, limits) and automatic weight adjustments based on channel health. The negative‑feedback algorithm, inspired by Sentinel’s sliding‑window rate limiter, divides a time window into equal slices, each with its own counter, enabling precise anomaly statistics.
For example, using a 10‑second window (LeapArray) with 5 slices (2 seconds each), we record total requests and error counts per channel, payment method, and settlement entity. During routing, the current slice ID is obtained, and metrics are compared against thresholds.
Two threshold types are used: absolute error count (e.g., >10 errors in the window) and error proportion (e.g., >10% errors). The final weight score is calculated as 10 − (10 × n / m), where n is error count and m is total requests; higher scores give higher channel priority.
2. System Adaptive Pressure Regulation
We built a pressure‑detection service that monitors runtime pressure of machines and components (JVM, CPU, MySQL connections, RabbitMQ queue depth, etc.) and computes a composite pressure index.
When the index exceeds configured thresholds, the payment system automatically applies rate‑limiting and reports the index to callers, reducing load at the source. Persistent high pressure triggers alerts for manual investigation.
3. Component Degradation to Improve High Availability
We use RabbitMQ as the asynchronous messaging backbone because it guarantees durability, supports delayed queues, and is mature.
Financial stability requires message persistence.
Business complexity demands delayed queue features.
Overall performance and stability considerations.
When RabbitMQ experiences failures, we fall back to Redis: storing messages in a Redis list for normal queues and a sorted set for delayed queues. After multiple RabbitMQ retries fail, the message body is saved in Redis and processed there, ensuring fast failure detection, dynamic rate adjustment, memory control, and idempotent consumption.
Most core queues now support automatic failover to Redis within three seconds, enhancing overall availability.
Current Status
The system reliably handles over 15 million daily payment requests, with peak concurrency exceeding 2,000. Horizontal scaling, automatic pressure regulation, and proactive degradation can increase capacity by 5‑10× under traffic spikes.
Through pressure detection and component degradation, the system automatically lowers the weight of abnormal channels and applies rate‑limiting, achieving over 99.9% availability, a 30% increase in payment success rate, and preventing millions of failed repayments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
