Scaling a Financial Accounting System to 100k TPS with Cloud‑Native Microservices
This article examines how a ten‑year‑old financial accounting platform transformed from a monolithic design into a cloud‑native, micro‑service architecture that achieved massive scalability, high availability, and 24‑hour real‑time processing through distributed batch scheduling, elastic scaling, and intelligent fault‑tolerance.
Introduction
In the wave of the digital economy, the financial industry is reshaping service efficiency. The "Ma Shang Consumption" accounting system faces two challenges: exponential business growth and high‑speed end‑of‑day batch processing.
Background
The traditional monolithic accounting system, running for ten years, cannot handle billions of users and massive data. Four core pain points are identified: performance bottlenecks, highly complex business scenarios, explosive data growth, and insufficient high‑availability guarantees.
Architecture Evolution
Stage 1.0 (2015‑2020): traditional monolith with waterfall development, single‑machine batch processing, limited scalability, and a daily processing capacity of about 1 million records.
Stage 2.0 (2020‑2025): cloud‑native architecture. Core technical advantages include distributed unit design (256 logical shards, multi‑datacenter active‑active deployment), upgraded batch engine (self‑developed DTS scheduler supporting dynamic DAG), and full containerization with minute‑level elastic scaling.
Batch Process Concepts
Job : the smallest configurable batch unit, which may contain many sub‑tasks such as interest accrual, settlement, etc.
Job Group : one or more jobs that can declare dependencies; jobs within the same group run in parallel or sequentially based on those dependencies.
Task : composed of one or more job groups; tasks are triggered by schedule or manually.
Task Chain : multiple tasks linked by dependency, e.g., DayEndBiz → DayEndMsg → GlsDayEnd.
Batch Scheduling Flow
Forward scheduling uses a dual‑center IDC deployment. Applications are classified as global (strong consistency), hot‑standby (fast failover), multi‑active (conflict resolution), or unitized (sharding). DTS provides a standard SDK and a status‑notification protocol for real‑time job state reporting.
Fault‑Tolerant Design
Automatic pod restart, rolling updates, dynamic scaling, and scheduler‑driven relocation ensure self‑healing. Health checks and log/metric review verify recovery, reducing failover time from hours to seconds.
Traffic Buffering (Flood‑Control Pool)
A distributed queue acts as a “data flood pool” before the batch pipeline, while layered buffering, dynamic throttling, and priority scheduling smooth peak loads and fill valleys.
7×24 Hour Accounting
The loan‑contract priority batch and dual‑balance model enable real‑time transaction posting, compressing latency from hours to under one minute.
Results
Throughput increased from 15 k TPS to 80 k TPS, batch latency improved by over 80 %, annual server cost reduced by ¥10 M, resource utilization rose by 300 %, and failover time dropped from hours to seconds.
Future Outlook
Future plans focus on real‑time user experience, AI‑driven monitoring, intelligent batch processing, and cross‑cloud resource orchestration to build a service‑oriented, intelligent financial core.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
