How Jiangsu Mobile Built a Billion‑Call Real‑Time Monitoring Platform with Prometheus
Facing the explosion of 5G traffic and billions of daily call records, Jiangsu Mobile’s IT operations team adopted Prometheus as the core time‑series database, designing a high‑availability, low‑latency monitoring platform that captures, stores, visualizes and predicts performance metrics across their massive billing system.
Background
Rapid growth of traffic services and the arrival of 5G have caused business support systems to expand to billions of daily billing records. Traditional monitoring tools cannot provide the required real‑time accuracy and throughput, creating a bottleneck for operations.
Time‑Series Database Selection
Prometheus, Graphite, InfluxDB and OpenTSDB were compared. Prometheus was selected because a single instance can ingest millions of samples per second, offers sub‑second query latency, compresses 16‑byte samples to an average of 1.37 bytes, and keeps disk I/O load below 1 % during real‑time queries.
Performance‑Management Platform Architecture
The monitoring stack is built around Prometheus and consists of the following components:
Two independent Prometheus clusters, one in each data‑center site, providing redundancy and load distribution.
System, application and Java logs are collected via pull scrapes. Performance and business metrics are first written to a pushgateway (temporary buffer) and then scraped by Prometheus.
Recent data are stored in the native Prometheus TSDB for fast alerting; a copy is forwarded to a remote long‑term store (InfluxDB) for historical analysis.
Load balancers route visualization and alert queries to either the Prometheus cluster or the remote store, ensuring high availability.
High‑Availability and Storage Enhancements
Leader election for HA: Each Prometheus node attempts to acquire a distributed lock at startup. The node that obtains the lock becomes the active leader; if it fails, another node acquires the lock and takes over, eliminating the single‑point‑of‑failure.
Hybrid storage: Short‑term metrics stay in Prometheus for real‑time alerts, while InfluxDB receives a replicated stream for long‑term retention and downstream data‑mining.
Pushgateway de‑duplication: After Prometheus scrapes the pushgateway, the gateway automatically deletes the pushed data, guaranteeing that each metric is ingested only once.
Custom visualization: Grafana’s native plugins were insufficient for multi‑dimensional dashboards, so a bespoke visualization tool was developed to display system, application and business metrics in a unified view.
Timezone correction: Prometheus’s default GMT timestamps were replaced with the local Beijing time by modifying the source code that reads the system clock.
Metric Collection Scope
Performance metrics: CPU, memory, I/O, latency, request rates, etc.
Business metrics: Call volume, processing throughput, service invocation counts, response times, and other domain‑specific indicators.
Real‑Time Dashboard
Aggregated metrics are presented as a unified health view for the BOSS system, covering application performance, business volume, service call counts and response times. Users can drill down by dimension, application or process to obtain instant operational insight, achieving a “one‑picture” monitoring experience.
Trend Prediction and Anomaly Detection
The massive time‑series dataset enables several analytical scenarios:
Performance prediction: Real‑time monitoring combined with historical comparison automatically estimates maximum processing speed and predicts the time required to handle pending call records.
Business trend forecasting: Daily, weekly and monthly aggregations (average, weighted average, moving average, weighted moving average, percentile statistics) are applied to forecast future call‑processing trends and resource utilization for capacity planning.
Anomaly detection: Algorithms evaluate period‑over‑period changes, mean‑standard‑deviation, local fluctuations and cyclical patterns to flag abnormal business behavior promptly.
Performance and Capacity
The platform currently ingests up to 100 k metrics per second, supporting real‑time monitoring of a system that processes billions of call records. Continuous analysis of this data enables precise capacity, performance and fault localization, as well as proactive mitigation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
