Building a Scalable Business Monitoring System: Architecture, Modules & Lessons
This article presents a comprehensive case study of a business monitoring system, covering its background, architectural analysis, module design, time‑series database selection, visualization with Grafana, alerting strategies, decision‑making logic, and intelligent monitoring experiments, followed by key takeaways and lessons learned.
Background
A network diagram illustrates the state of the online business system, highlighting the need for a robust monitoring solution.
Outline
Business monitoring system architecture analysis
Design and optimization of monitoring modules
Attempts at intelligent monitoring
Business Monitoring System Architecture
There is no perfect architecture; every design is a result of trade‑offs.
Design Background
Incomplete monitoring items; need rapid implementation
Frequent operational activities cause alarm fatigue
Lack of real‑time reference during online adjustments
Mainstream Architectures
Case Studies
Alibaba:
Mogujie:
Characteristics
Core keywords: massive scale, real‑time
Focus on big‑data processing; weak alarm analysis
Monitoring staff shortage and lack of experience
Question: Should big data be part of the monitoring system?
Current Monitoring Architecture at Qudian
Built on existing business monitoring development and resources
Modules split via queues for easy upgrades
Leverages excellent open‑source software
Monitoring Module Design and Optimization
Each module can be replaced by a better solution at any time.
Sampling Module
Sources: SQL, API, ElasticSearch (real‑time logs), etc.
Execution: crontab schedule, Laravel queue tasks
Issues
Slow collection causing whole pipeline failure
Performance avalanche of large‑data tables
Need additional monitoring for collection itself
Storage & Compute Module
Time‑Series Database (TSDB)
A TSDB is specialized for managing time‑series data, offering high compression, fast queries, and suitability for IoT scenarios.
Key Features
Efficient time‑dimension queries
Convenient down‑sampling
Massive storage capacity
Automatic expiration handling
TSDB Ranking
Qudian's Choice – InfluxDB
No external dependencies
Quick to adopt
Elegant RESTful API
Powerful SQL‑like query language
Horizontal scalability
Written in Go
InfluxQL Example
<code># demo 1<br/>SELECT <stuff> FROM <measurement_name> WHERE <some_conditions><br/><br/># demo 2<br/>SELECT * FROM "foodships"<br/><br/># demo 3<br/>SELECT * FROM "foodships" WHERE "planet"='Saturn'<br/><br/># demo 4<br/>SELECT * FROM "foodships" WHERE "planet"='Saturn' AND time > '2015-04-16 12:00:01'<br/><br/># demo 5<br/>SELECT * FROM "foodships" WHERE time > now() - 1h</code>Issues Encountered
Cluster feature no longer open‑source (community projects are following up)
Single‑point issue (InfluxDB Relay)
Why is InfluxDB more efficient than MySQL?
Solution for Single‑Point Issue
Write multiple copies of data to maintain high availability.
Visualization Module
Open‑Source Project Grafana
Beautiful UI
Comprehensive API support
Rich data‑source integration
Complete reporting features
Basic Concepts
Data source – provides time‑series data for Grafana
Organization – multiple orgs can share a single Grafana instance
User – can belong to one or more orgs with different permissions
Row – panel grouping in dashboards
Panel – basic display unit with its own query editor
Query editor – exposes data‑source capabilities
Dashboard – composition of panels for final visualization
Supported Data Sources
Graphite
ElasticSearch
CloudWatch
InfluxDB
OpenTSDB
KairosDB
Prometheus
Issues with Grafana
Default storage is SQLite, leading to single‑point risk
Display problems when collection interval is not met
Alert Notification Module
Design Features
Multiple notification channels (SMS, email, phone)
Flexible notification policies
Group‑based recipient management
Problems Encountered
SMS/email failures (critical metrics need multiple channels)
Duplicate alarm rate limiting to reduce noise
Complicated onboarding/offboarding of personnel
Abnormal Decision Module
Challenges
Business monitoring issues are harder to define than system monitoring
Frequent promotional activities increase definition difficulty
Higher monitoring requirements in internet finance
Decision Strategies
1) Sample‑Based Comparison
Use 7‑day sample, average after removing extremes
Small data volume leads to high randomness and false alarms
Adjust statistics period or strategy to mitigate
2) Forecast‑Based Comparison
Assumes normal curves have no sudden spikes
Uses grey‑prediction model; requires little data, high accuracy
Issues: inherent spikes, slow‑changing anomalies – mitigated by multi‑dimensional monitoring
Intelligent Monitoring Attempts
Establish relationships between metrics.
Rule engine
Neural network
Rule Engine
Purpose
Externalize rules for reuse and avoid code changes
Reasoning performed by engine, reducing complex logic code
Developers focus on business logic instead of rule implementation
Example
IF: login count increases, order volume rises, new‑user credit pass rate drops, credit application count rises
THEN: trigger user recall activity
Neural Network
Applies to handwritten digit recognition (MNIST).
Download MNIST data
Define model
Train model
Validate model
Actual application in Qudian’s monitoring system:
Summary, Experience, Lessons
Online issues are always the most urgent and important
Abnormal judgment is essentially a classification problem
Design must anticipate limitations of decision methods (e.g., “blind men feeling an elephant”)
Business monitoring requires continuous operation, optimization, and joint maintenance with business teams
A complete abnormal handling process is essential
The Qudian Group monitoring system now covers registration, login, order, credit, risk control, loan issuance, repayment, etc., and will continue to optimize for completeness, accuracy, and timeliness, ensuring stable online services for users.
Qudian (formerly Qufenqi) Technology Team
Technology team focusing on architecture, service-oriented design, top-tier tools, automation platforms, end-to-end development solutions, talent cultivation, and engineer career growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.