Master Payment Gateway Design: Multi‑Channel Aggregation, Smart Routing, and End‑to‑End Merchant Onboarding
The article explains how to build an enterprise‑grade payment gateway that unifies over 50 providers, performs millisecond‑level smart routing, handles failover, dynamic fee calculation, automated merchant onboarding, sharded storage, and comprehensive monitoring to sustain millions of transactions per day.
Why a full‑featured payment gateway is needed
Developers often start with a simple payment interface that receives parameters, calls a third‑party API, and returns a result. In real business scenarios this quickly leads to single‑channel failures that block all orders, TPS spikes that cause system collapse, complex merchant fee structures that are hard to maintain, and reconciliation mismatches. The root cause is that only a "payment interface" has been built, not a complete "payment gateway system".
Core capabilities of an enterprise payment gateway
Unified access to multiple payment methods (cards, UPI, wallets, online banking)
Aggregation of more than 50 payment providers
Intelligent routing decisions based on cost, success rate, and latency
Automatic failover handling
Merchant onboarding and KYC workflow
Dynamic fee calculation
Reconciliation and settlement
Scale expectations
Monthly transaction volume: 1 billion
Peak TPS: >20,000
Number of merchants: 100,000
Number of payment providers: 50+
Storage requirements:
Monthly transaction data: 2 TB
7‑year retention: 168 TB
Consequences:
Single‑node solutions are infeasible
Distributed architecture with sharding is mandatory
All core logic must complete within milliseconds
Overall architecture
Core service breakdown
PaymentService
Unified entry point
Manages transaction lifecycle
Coordinates routing, fee calculation, and channel invocation
RoutingService
Selects the optimal payment channel
Real‑time decision making (<50 ms)
MerchantService
Merchant registration
KYC verification
API key management
ReconciliationService
Matches gateway records with provider data
SettlementService
Handles T+1 / T+2 fund settlement
Smart routing as the competitive edge
Routing scoring model
ProviderScore =
costWeight × costScore +
latencyWeight × latencyScore +
successRateWeight × successRateScore +
healthWeight × healthScore +
loadWeight × loadScoreDefault weight example:
Cost: 30%
Success rate: 35%
Latency: 20%
Health: 10%
Load: 5%
Routing flow
Failover and circuit breaker
Circuit breaker states
Closed – normal operation
Open – immediate failure
Half‑Open – trial recovery
Strategy:
5 consecutive failures → Open
After 60 s → Half‑Open
Successful recovery → Closed
Dynamic fee calculation
Supported modes:
1. Percentage
Fee = amount × rate2. Fixed amount
Fee = fixed3. Tiered pricing
0‑1000: 2%
1000‑10000: 1.5%4. Hybrid
Fee = fixed + (amount × rate)Automated merchant onboarding
API key generation rule
sk_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxImplementation highlights:
32‑character random string
Store only SHA‑256 hash
Return the key only once
Scalable database design
Table schemas (simplified)
CREATE TABLE merchants (
merchant_id VARCHAR(50) PRIMARY KEY,
business_name VARCHAR(255) NOT NULL,
email VARCHAR(255) UNIQUE,
kyc_status VARCHAR(20),
status VARCHAR(20),
created_at TIMESTAMP
); CREATE TABLE transactions (
transaction_id VARCHAR(50) PRIMARY KEY,
merchant_id VARCHAR(50),
amount DECIMAL(15,2),
status VARCHAR(20),
provider_id VARCHAR(50),
created_at TIMESTAMP
) PARTITION BY RANGE (created_at);Sharding strategy:
Hash‑based split by merchant_id Time‑based monthly partitions
Reconciliation process
Matching rules:
Same transaction_id Exact amount match
Timestamp difference < 5 minutes
Settlement workflow
T+1 – standard
T+2 – high‑risk merchants
T+0 – value‑added services
Idempotency design
Idempotent key format: merchant_id + order_id Strategy:
Store keys in Redis for 24 hours
Duplicate requests return the original result
System expansion strategies
Service layer
Stateless design
Horizontal scaling
Data layer
Sharding and partitioning
Read‑write separation
Cache layer
Local cache + Redis
Channel layer
Connection pooling
Multi‑instance load balancing
Monitoring and alerting
Key metrics:
Success rate
TPS
Latency (P95 / P99)
Channel health
Error rate
Alert thresholds:
Success rate < 99%
Latency > 500 ms
Error rate > 0.1%
Cost considerations
Rough cost components:
Compute resources
Database
Cache
Network
Storage
Optimization directions:
Auto‑scaling
Spot instances
Multi‑region deployment
Future evolution
Machine‑learning‑driven routing decisions
Real‑time full‑volume reconciliation
Multi‑region disaster recovery
A/B testing of routing strategies
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
LuTiao Programming
LuTiao Programming is a friendly community offering free programming lessons. We inspire learners to explore new ideas and technologies and quickly acquire job-ready skills.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
