Designing a High‑Availability Payment System: Flow, Optimization, and Fault Tolerance
This article details the end‑to‑end design of a payment system, covering transaction flow, horizontal and vertical pre‑optimizations, task scheduling, sharding strategies, data structures, high‑availability mechanisms such as channel isolation and Hystrix, and future planning for dynamic scaling and intelligent routing.
3.2 Business Process
A consumer transaction first enters the payment core where it undergoes initialization, risk assessment, channel routing, message assembly and submission. The asynchronous transaction publishes a message to an MQ cluster. A job listener caches the message, and a scheduled task periodically queries the transaction status. The status can be retrieved via an internal query service.
3.3 Horizontal Pre‑Optimization
Access Layer : Consolidate common entry points. All order‑related requests, regardless of payment channel, go through a unified order API identified by a serviceId.
Service Layer : Extract core business logic into shared services. The services are selected at runtime via serviceId, minimizing changes to the business code.
Cache Layer : Add a caching layer above the database to store transaction information. All subsequent read/write operations hit the cache first, reducing DB load and improving latency.
3.4 Vertical Pre‑Optimization
Core Transaction : Handles the main payment flow and refunds. This path directly impacts user perception because failures are immediately visible to the user.
Task Jobs : Synchronize transaction status asynchronously. The jobs are decoupled from the core transaction via MQ.
Query Service : Provides an internal API for querying transaction status.
3.5 Task Jobs
The internal query strategy uses two queues and a batch process:
Memory Queue : Implements delayed retries (e.g., 10 s, 5 s) or exponential back‑off (2ⁿ seconds). It is used for fast synchronization of single‑transaction status to improve user experience.
Cache Queue : Built on a Redis cluster together with the Elastic‑Job distributed scheduler. It batches delayed orders for status synchronization.
DB Batch : Also driven by Elastic‑Job, this path provides a manual‑intervention entry for cases where channel delays are excessive or abnormal.
3.6 Sharding Strategy
Task sharding distributes a job across multiple machines, overcoming single‑machine capacity limits and reducing the impact of isolated task failures.
Elastic‑Job only assigns sharding items to job instances; developers must map each sharding item to the actual data.
Data sharding stores order numbers modulo a divisor in a Redis sorted set (zset).
3.7 Data Structure
Ordered Set (zset) : Stores order numbers according to the sharding logic (modulo operation) and the corresponding score range.
String : Serializes the full transaction details for quick retrieval.
Design Idea
MQ consumer (job node) receives a message and writes the transaction data into the cache.
Job node fetches a batch of order numbers from the cache based on its sharding item and score range.
The business loop processes each order number, retrieves the detailed transaction info from the cache, and executes the query logic.
zset elements may expire; the business must handle expiration either by a dedicated cleanup task or by checking and removing expired entries during processing.
4 High‑Availability Design
4.1 Channel Isolation
Under high concurrency, external channel stability heavily influences the system. Hystrix is used for fault tolerance with thread isolation as the chosen strategy.
4.2 Query Gateway
Query traffic can be up to six times the payment traffic. A dedicated query service isolates read‑heavy workloads from the core transaction path, protecting its performance.
4.3 Channel Merchant Cache
Static channel information (institution ID, merchant ID, keys, etc.) is cached in a distributed cache with a TTL. A manual‑update interface allows operators to adjust the data without redeploying.
Fault‑tolerant design prevents small issues from cascading into a service avalanche.
Thread‑pool isolation: servlet containers (Tomcat, Jetty) use worker threads. When the pool is saturated, excess requests are queued or rejected, which can cause a cascade failure if timeouts are not configured.
Hystrix wraps each business request into a command with its own thread pool, stored in a ConcurrentHashMap. Proper timeout settings are required to avoid indefinite blocking.
Even with thread‑pool isolation, timeouts must be configured to prevent thread‑pool saturation.
Hystrix Thread Monitoring
Real‑time dashboards display thread‑pool usage, helping engineers decide whether to scale resources.
Key metrics integrated into the internal monitoring platform:
Node latency monitoring to pinpoint bottlenecks.
Success‑rate monitoring with aggregated transaction counts and alerts for channel failures.
Response‑code ranking for root‑cause analysis and alerting on critical codes.
Daily email inspection reports for self‑service analysis.
5 Planning
Dynamic Sharding : Automate both data and task sharding to handle continuous growth and fully utilize machine resources.
Intelligent Routing : Replace manual channel switching with automated routing that redirects traffic to alternative channels when a channel becomes abnormal or unavailable.
Full‑Link Monitoring : Build an end‑to‑end trace that records each transaction’s lifecycle across machines and channels. Visualize the trace for operations, customer service, and other stakeholders.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
