How We Built an Automated Payment Channel Management System with Redis and Prometheus

To handle growing payment traffic and unreliable third‑party gateways, the team at Zhuanzhuan designed an automated payment‑channel management platform that uses a custom Redis‑based time‑series store, Prometheus monitoring, and a sliding‑window failure‑rate algorithm to detect, alert, and eventually auto‑switch faulty channels.

dbaplus Community
dbaplus Community
dbaplus Community
How We Built an Automated Payment Channel Management System with Redis and Prometheus

Background

As business volume increased, Zhuanzhuan integrated many payment channels, but third‑party stability varied, causing frequent failures that were only detected after alerts or user complaints. Relying solely on manual maintenance was insufficient for a core payment system, prompting the need for an automated payment‑channel management solution.

Design Goals

Monitor multiple channels and entities simultaneously.

Rapid fault detection and root‑cause localization.

Minimize false positives and false negatives.

Enable automatic channel failover.

Technology Selection

The team evaluated existing circuit‑breaker libraries (e.g., Hystrix) and found them unsuitable because they operate at the interface level and cannot handle channel‑ or merchant‑level degradation, nor allow custom traffic probing during failover.

For time‑series storage, popular databases were compared; the final choice narrowed to Prometheus and a custom Redis‑based solution. Prometheus offers simplicity and reliability but sacrifices precise data accuracy, which is unacceptable for high‑sensitivity channel switching. Redis, familiar to Java backend developers, provides lower learning and maintenance costs.

Time‑Series Database Ranking
Time‑Series Database Ranking

Architecture Design

Payment requests first pass through a routing layer that selects usable channels. After a channel is chosen, the gateway processes the order and reports the third‑party response via MQ to the monitoring system.

The monitoring system stores incoming data in Redis, then a calculation module aggregates failure rates per channel and triggers alerts based on configured rules. Redis data is periodically backed up to MySQL, and stale entries are cleaned to control size.

For visual inspection, aggregated metrics are also pushed to Prometheus and displayed on Grafana dashboards.

Channel Metrics Dashboard
Channel Metrics Dashboard

Implementation Details

1. Data Structure

The Redis schema mimics time‑series concepts:

tags (set) : stores monitored dimensions, e.g., merchant IDs.

tims (zset) : stores timestamps (seconds) per merchant, enabling range queries ordered by score.

fields (hash) : stores per‑second success/failure counts and error reasons.

Example storage layout:

1. set
   key: routeAlarm:alarmitems
   value: 微信-打款-100000111
          微信-打款-100000112
          ...
2. zset
   key: routeAlarm:alarmitem:timeStore:微信-打款-100000111
   score/value: 1657164225, 1657164226, ...
3. hash
   key: routeAlarm:alarmitem:fieldStore:微信-打款-100000111:1657164225
   success: 10
   fail: 5
   balance_not_enough: 3
   thrid_error: 2
   ...

2. Core Algorithm

A hybrid of local counting and a sliding window is used. Each second records success and failure counts; the algorithm computes aggregate counts over a configurable window (e.g., 1 minute with a 10‑second sampling interval) to derive per‑channel failure rates.

Core Algorithm Diagram
Core Algorithm Diagram

Monitoring frequency and window size affect accuracy: too low a frequency yields insufficient samples; too high a frequency may miss short spikes, leading to false negatives.

3. Handling Low‑Traffic Channels

For channels with very few transactions, a single failure within the window expands the window size (doubling up to ten‑fold) before raising an alert, ensuring that rare but critical failures are not ignored.

Final Effects

Rapid fault alerts with precise root‑cause identification.

Deduplication of repeated alerts.

Automatic recovery detection and channel re‑enablement.

Channel Exception Alert
Channel Exception Alert
Merged Duplicate Alerts
Merged Duplicate Alerts
Channel Recovery
Channel Recovery

Future Plans

Further refine the monitoring algorithm to achieve >99% alert accuracy.

Integrate automatic channel offline switching upon fault detection.

Implement automatic channel re‑online detection and activation after recovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringautomationPrometheuspaymentfault-tolerance
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.