Operations 12 min read

Building a Scalable Business Monitoring System: Architecture, Modules & Lessons

This article presents a comprehensive case study of a business monitoring system, covering its background, architectural analysis, module design, time‑series database selection, visualization with Grafana, alerting strategies, decision‑making logic, and intelligent monitoring experiments, followed by key takeaways and lessons learned.

Qudian (formerly Qufenqi) Technology Team
Qudian (formerly Qufenqi) Technology Team
Qudian (formerly Qufenqi) Technology Team
Building a Scalable Business Monitoring System: Architecture, Modules & Lessons

Background

A network diagram illustrates the state of the online business system, highlighting the need for a robust monitoring solution.

Outline

Business monitoring system architecture analysis

Design and optimization of monitoring modules

Attempts at intelligent monitoring

Business Monitoring System Architecture

There is no perfect architecture; every design is a result of trade‑offs.

Design Background

Incomplete monitoring items; need rapid implementation

Frequent operational activities cause alarm fatigue

Lack of real‑time reference during online adjustments

Mainstream Architectures

Case Studies

Alibaba:

Mogujie:

Characteristics

Core keywords: massive scale, real‑time

Focus on big‑data processing; weak alarm analysis

Monitoring staff shortage and lack of experience

Question: Should big data be part of the monitoring system?

Current Monitoring Architecture at Qudian

Built on existing business monitoring development and resources

Modules split via queues for easy upgrades

Leverages excellent open‑source software

Monitoring Module Design and Optimization

Each module can be replaced by a better solution at any time.

Sampling Module

Sources: SQL, API, ElasticSearch (real‑time logs), etc.

Execution: crontab schedule, Laravel queue tasks

Issues

Slow collection causing whole pipeline failure

Performance avalanche of large‑data tables

Need additional monitoring for collection itself

Storage & Compute Module

Time‑Series Database (TSDB)

A TSDB is specialized for managing time‑series data, offering high compression, fast queries, and suitability for IoT scenarios.

Key Features

Efficient time‑dimension queries

Convenient down‑sampling

Massive storage capacity

Automatic expiration handling

TSDB Ranking

Qudian's Choice – InfluxDB

No external dependencies

Quick to adopt

Elegant RESTful API

Powerful SQL‑like query language

Horizontal scalability

Written in Go

InfluxQL Example

<code># demo 1<br/>SELECT &lt;stuff&gt; FROM &lt;measurement_name&gt; WHERE &lt;some_conditions&gt;<br/><br/># demo 2<br/>SELECT * FROM "foodships"<br/><br/># demo 3<br/>SELECT * FROM "foodships" WHERE "planet"='Saturn'<br/><br/># demo 4<br/>SELECT * FROM "foodships" WHERE "planet"='Saturn' AND time &gt; '2015-04-16 12:00:01'<br/><br/># demo 5<br/>SELECT * FROM "foodships" WHERE time &gt; now() - 1h</code>

Issues Encountered

Cluster feature no longer open‑source (community projects are following up)

Single‑point issue (InfluxDB Relay)

Why is InfluxDB more efficient than MySQL?

Solution for Single‑Point Issue

Write multiple copies of data to maintain high availability.

Visualization Module

Open‑Source Project Grafana

Beautiful UI

Comprehensive API support

Rich data‑source integration

Complete reporting features

Basic Concepts

Data source – provides time‑series data for Grafana

Organization – multiple orgs can share a single Grafana instance

User – can belong to one or more orgs with different permissions

Row – panel grouping in dashboards

Panel – basic display unit with its own query editor

Query editor – exposes data‑source capabilities

Dashboard – composition of panels for final visualization

Supported Data Sources

Graphite

ElasticSearch

CloudWatch

InfluxDB

OpenTSDB

KairosDB

Prometheus

Issues with Grafana

Default storage is SQLite, leading to single‑point risk

Display problems when collection interval is not met

Alert Notification Module

Design Features

Multiple notification channels (SMS, email, phone)

Flexible notification policies

Group‑based recipient management

Problems Encountered

SMS/email failures (critical metrics need multiple channels)

Duplicate alarm rate limiting to reduce noise

Complicated onboarding/offboarding of personnel

Abnormal Decision Module

Challenges

Business monitoring issues are harder to define than system monitoring

Frequent promotional activities increase definition difficulty

Higher monitoring requirements in internet finance

Decision Strategies

1) Sample‑Based Comparison

Use 7‑day sample, average after removing extremes

Small data volume leads to high randomness and false alarms

Adjust statistics period or strategy to mitigate

2) Forecast‑Based Comparison

Assumes normal curves have no sudden spikes

Uses grey‑prediction model; requires little data, high accuracy

Issues: inherent spikes, slow‑changing anomalies – mitigated by multi‑dimensional monitoring

Intelligent Monitoring Attempts

Establish relationships between metrics.

Rule engine

Neural network

Rule Engine

Purpose

Externalize rules for reuse and avoid code changes

Reasoning performed by engine, reducing complex logic code

Developers focus on business logic instead of rule implementation

Example

IF: login count increases, order volume rises, new‑user credit pass rate drops, credit application count rises

THEN: trigger user recall activity

Neural Network

Applies to handwritten digit recognition (MNIST).

Download MNIST data

Define model

Train model

Validate model

Actual application in Qudian’s monitoring system:

Summary, Experience, Lessons

Online issues are always the most urgent and important

Abnormal judgment is essentially a classification problem

Design must anticipate limitations of decision methods (e.g., “blind men feeling an elephant”)

Business monitoring requires continuous operation, optimization, and joint maintenance with business teams

A complete abnormal handling process is essential

The Qudian Group monitoring system now covers registration, login, order, credit, risk control, loan issuance, repayment, etc., and will continue to optimize for completeness, accuracy, and timeliness, ensuring stable online services for users.
MonitoringArchitectureoperationstime-series databaseInfluxDBGrafana
Qudian (formerly Qufenqi) Technology Team
Written by

Qudian (formerly Qufenqi) Technology Team

Technology team focusing on architecture, service-oriented design, top-tier tools, automation platforms, end-to-end development solutions, talent cultivation, and engineer career growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.