Big Data 19 min read

Design and Implementation of a Hundred‑Billion‑Scale Real‑Time Monitoring System

The paper details a hundred‑billion‑scale real‑time monitoring system, outlining a layered architecture from collection to alerting, comparing Oceanus + Elastic Stack and Zabbix + Prometheus + Grafana solutions, and showing how targeted optimizations in stream processing and Elasticsearch achieve scalability, low latency, and significant cost savings.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Design and Implementation of a Hundred‑Billion‑Scale Real‑Time Monitoring System

This article explains the overall architecture design and technical implementation of a real‑time monitoring system capable of handling hundreds of billions of data points.

Why build a monitoring system? In the post‑mobile‑Internet era, user experience depends on stable services. Large internet companies (e.g., ride‑hailing platforms) require minute‑level detection and resolution of incidents. Distributed business systems consist of many components (databases, caches, message queues), any of which can fail, making comprehensive monitoring essential.

Monitoring system requirements include data collection, aggregation, analysis, storage, alerting, dashboard display, and high availability. The system should quickly locate anomalies and reduce mean time to repair.

Overall design abstracts the data flow into the stages: collect → aggregate → process → store → analyze → display → alert. The architecture consists of the following layers:

Data collection layer (agents, RPC tracing, HTTP push)

Data aggregation layer (message queues)

Data processing layer (cleaning, transformation, basic aggregation)

Data analysis layer (correlation, anomaly detection, fault diagnosis)

Data storage layer (log, metric, time‑series storage for dashboards)

Alerting layer (rule‑based notifications via phone, email, WeChat, SMS)

Dashboard layer (monitoring and alert panels)

Technical selection presents two solution families:

Oceanus + Elastic Stack – Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) provides log collection, indexing, and visualization. Oceanus (a Flink‑based stream computing service) handles real‑time data cleaning, transformation, and aggregation, feeding results back to Elasticsearch for alerting and to Grafana for dashboards.

Zabbix + Prometheus + Grafana – Zabbix offers distributed monitoring and customizable alerts; Prometheus supplies lightweight time‑series metric collection; Grafana provides unified visualization. This stack is simple to deploy but struggles with ultra‑large data volumes.

Each solution’s advantages and drawbacks are discussed (e.g., Beats’ low overhead vs. Logstash’s resource consumption; Prometheus’ metric‑only focus vs. Zabbix’s steep learning curve).

Oceanus optimizations include:

SQL performance improvements for large‑scale jobs.

Automatic handling of data skew via Local‑Global aggregation and mini‑batch processing.

UDF result caching to avoid repeated execution.

Bucket‑based join for dimension tables to reduce memory pressure.

Job intelligent diagnostics and visual alerts for failures, snapshots, and resource anomalies.

Automatic job scaling based on CPU, memory, and back‑pressure metrics.

Elasticsearch Service optimizations cover:

Storage model redesign: time‑based tiered merge strategy improves query performance and reduces write latency.

Cost reduction through hot‑warm‑cold data tiering (HDD for warm data, COS for cold archives), achieving ~10× storage cost savings.

Memory optimization: moving large FST structures off‑heap and employing multi‑level caching to lower GC overhead.

The article concludes that combining Oceanus with Elastic Stack delivers a scalable, real‑time monitoring platform that meets the performance and cost requirements of hundred‑billion‑scale data scenarios.

Author: Long Yichen, Senior Engineer, Tencent CSIG

Performance OptimizationSystem ArchitectureBig Datastream processingReal-time MonitoringOceanusElastic Stack
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.