Evolution and Architecture of a Financial Risk Control System: From Monolith to Microservices and Commercialization
This article details the design, refactoring, performance optimization, reliability monitoring, and commercialization of a financial risk control system, covering rule abstraction, decision workflows, feature engineering, model integration, and the trade‑offs between latency and accuracy in large‑scale production environments.
01 Phase 1: Risk Service Evolution
Early risk system was a large monolithic application containing rule decision, workflow configuration, model calculation, data integration and feature processing, leading to issues such as low change efficiency, duplicated code, high failure rate, and difficulty supporting multiple business lines.
1. Early System Problems
Low change efficiency, long cycles
Fragmented, repetitive requirements
Severe code coupling
High production fault rate
Inability to meet multi‑business, multi‑scenario needs
2. Refactoring Abstraction
We abstracted the risk engine into six elements—rule, decision, workflow, model, feature, and data—allowing separation of rule logic from code via a domain‑specific language (DSL) and enabling visual configuration.
Rule Abstraction
Rules are expressed as feature‑operator‑threshold triples and compiled into executable DSL text, which can be parsed by engines such as Drools, Groovy, QlExpress, etc.
Combined Decision
Multiple rules can be grouped into rule sets, decision trees, matrices, or tables, providing conflict resolution and hierarchical decision flow.
Workflow Orchestration
Decision flows are composed of sequential and branch nodes (e.g., A/B testing), executed via pipeline or Rete algorithms; we adopt pipeline for simplicity and performance.
Visualization
A visual console lets risk experts adjust rules and flows without developer intervention, storing configurations in relational databases and converting them to DSL at publish time.
Model Integration
Machine‑learning models (online prediction and offline training) complement rule‑based decisions, forming a closed‑loop model lifecycle.
Data Features
Features are derived from internal (first‑party) and external (third‑party) data sources, processed by a feature engine that resolves dependencies via DAG execution.
02 Phase 2: Performance and Reliability
Growth in decision volume and scenario diversity introduced challenges in latency, accuracy, and system stability.
1. New Issues
Increased decision calls and data dimensions demand higher performance; different scenarios require trade‑offs between speed (seconds) and precision (minutes).
2. Decision Timeliness vs Accuracy
Real‑time decisions prioritize speed with tolerant data failures; near‑real‑time decisions allow retries and longer latency for higher accuracy.
3. Feature Computation Strategies
We employ real‑time, pre‑computation, batch, and hybrid calculations to balance latency and correctness, using CDC + Kafka + Flink for efficient pre‑computation.
4. Reliability Monitoring
A multi‑layer monitoring system (business, application, system, tracing) provides real‑time dashboards, alerts on rule hit rates, feature missing rates, and decision pass rates, and supports traffic replay for offline testing.
03 Phase 3: Commercialization
After two years of production refinement, the platform is packaged as a SaaS offering (“Magic Cube”), supporting localized deployment, extensibility, and delivering risk‑control capabilities to external customers.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.