How We Built an Automated Payment Channel Management System with Redis and Prometheus
To handle growing payment traffic and unreliable third‑party gateways, the team at Zhuanzhuan designed an automated payment‑channel management platform that uses a custom Redis‑based time‑series store, Prometheus monitoring, and a sliding‑window failure‑rate algorithm to detect, alert, and eventually auto‑switch faulty channels.
Background
As business volume increased, Zhuanzhuan integrated many payment channels, but third‑party stability varied, causing frequent failures that were only detected after alerts or user complaints. Relying solely on manual maintenance was insufficient for a core payment system, prompting the need for an automated payment‑channel management solution.
Design Goals
Monitor multiple channels and entities simultaneously.
Rapid fault detection and root‑cause localization.
Minimize false positives and false negatives.
Enable automatic channel failover.
Technology Selection
The team evaluated existing circuit‑breaker libraries (e.g., Hystrix) and found them unsuitable because they operate at the interface level and cannot handle channel‑ or merchant‑level degradation, nor allow custom traffic probing during failover.
For time‑series storage, popular databases were compared; the final choice narrowed to Prometheus and a custom Redis‑based solution. Prometheus offers simplicity and reliability but sacrifices precise data accuracy, which is unacceptable for high‑sensitivity channel switching. Redis, familiar to Java backend developers, provides lower learning and maintenance costs.
Architecture Design
Payment requests first pass through a routing layer that selects usable channels. After a channel is chosen, the gateway processes the order and reports the third‑party response via MQ to the monitoring system.
The monitoring system stores incoming data in Redis, then a calculation module aggregates failure rates per channel and triggers alerts based on configured rules. Redis data is periodically backed up to MySQL, and stale entries are cleaned to control size.
For visual inspection, aggregated metrics are also pushed to Prometheus and displayed on Grafana dashboards.
Implementation Details
1. Data Structure
The Redis schema mimics time‑series concepts:
tags (set) : stores monitored dimensions, e.g., merchant IDs.
tims (zset) : stores timestamps (seconds) per merchant, enabling range queries ordered by score.
fields (hash) : stores per‑second success/failure counts and error reasons.
Example storage layout:
1. set
key: routeAlarm:alarmitems
value: 微信-打款-100000111
微信-打款-100000112
...
2. zset
key: routeAlarm:alarmitem:timeStore:微信-打款-100000111
score/value: 1657164225, 1657164226, ...
3. hash
key: routeAlarm:alarmitem:fieldStore:微信-打款-100000111:1657164225
success: 10
fail: 5
balance_not_enough: 3
thrid_error: 2
...2. Core Algorithm
A hybrid of local counting and a sliding window is used. Each second records success and failure counts; the algorithm computes aggregate counts over a configurable window (e.g., 1 minute with a 10‑second sampling interval) to derive per‑channel failure rates.
Monitoring frequency and window size affect accuracy: too low a frequency yields insufficient samples; too high a frequency may miss short spikes, leading to false negatives.
3. Handling Low‑Traffic Channels
For channels with very few transactions, a single failure within the window expands the window size (doubling up to ten‑fold) before raising an alert, ensuring that rare but critical failures are not ignored.
Final Effects
Rapid fault alerts with precise root‑cause identification.
Deduplication of repeated alerts.
Automatic recovery detection and channel re‑enablement.
Future Plans
Further refine the monitoring algorithm to achieve >99% alert accuracy.
Integrate automatic channel offline switching upon fault detection.
Implement automatic channel re‑online detection and activation after recovery.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
