How Weibo’s Hubble Platform Uses AI for Real‑Time Monitoring and Trend Forecasting

The article details Weibo Advertising's Hubble monitoring system, describing its three‑layer architecture, metric taxonomy, AI‑driven trend prediction with LSTM models, dynamic alert thresholds, and performance testing using GoReplay, illustrating how large‑scale data and machine learning enable proactive operations.

dbaplus Community
dbaplus Community
dbaplus Community
How Weibo’s Hubble Platform Uses AI for Real‑Time Monitoring and Trend Forecasting

Background Introduction

The Weibo advertising system integrates traffic distribution, delivery, settlement, CTR estimation, and CRM, and its rapid growth has made monitoring a massive challenge, requiring thousands of modules to compute and communicate continuously.

Overall Architecture

Hubble relies on two platforms: D+ (a commercial big‑data foundation handling data collection, storage, computation, and providing API interfaces) and Hubble (an intelligent panoramic monitoring and insight platform). The architecture consists of three layers:

Data Collection Layer : Real‑time ingestion of system logs, metrics, business logs, and business metrics using tools such as Flume, Scribe, Filebeat, Metricbeat, and a custom lightweight w‑agent client managed via ZooKeeper.

Data Analysis Layer : ETL, preprocessing, and aggregation; data is persisted to HDFS, with offline model training stored in HDFS and an alert trigger module applying rules.

Visualization Layer : Stores processed data in Druid, Elasticsearch, MySQL, ClickHouse; provides dashboards, API access, and alarm management.

Core Function Analysis

1. Panoramic Monitoring

Basic monitoring aggregates time‑series data at a granularity of one second for key metrics, using D+ for real‑time data services. The system visualizes both machine‑level health (healthy, sub‑healthy, unhealthy) and service‑level status, displaying alerts and topology maps.

2. Trend Prediction

Weibo advertising adopts machine‑learning‑based forecasting (LSTM) to predict system metric trends, outperforming traditional statistical methods such as ARIMA. The pipeline includes offline model training on eight days of historical data (window length = 3, 66.7% training split) and online inference using Kafka‑fed real‑time data, with results stored in Druid for dashboard display.

LSTM model details: input‑gate, forget‑gate, output‑gate with tanh activation, sigmoid gates, dropout 0.2, MSE loss, RMSprop optimizer, 50 epochs, batch size 1, final dense layer with linear activation. Trained model saved as an .h5 file in HDFS.

3. Dynamic Thresholds

Static, experience‑based alert thresholds cause false positives or missed alerts. By combining trend predictions, Hubble sets dynamic thresholds (e.g., ±10% of the forecast curve) that adapt to metric volatility.

4. Service Governance

To ensure high availability under traffic spikes, the system emphasizes proactive risk detection, capacity planning, and automated mitigation. Real‑world traffic surges (e.g., viral events) demand rapid scaling, which is evaluated through online traffic replay.

Performance Evaluation and Traffic Replay

GoReplay, an open‑source Go‑based tool, captures live HTTP traffic and replays it against test environments. Configuration examples demonstrate capturing from port 80, forwarding to target servers, injecting custom headers, and collecting statistics.

./goreplay 
--input-raw :80 
--exit-after 60s 
--output-http "192.168.1.1:80" 
--output-http "192.168.1.2:8080|10%" 
--http-set-header 'User-Agent: Gor' 
--output-http-timeout 100ms 
--stats 
--output-http-stats

Replay enables full‑link tracing with unique Trace IDs, allowing measurement of QPS, latency distribution, and error rates across services during simulated peak loads.

Conclusion

Effective monitoring starts with comprehensive metric coverage, then refines accuracy and abstracts essential signals. Weibo’s advertising infrastructure has pioneered AI‑enhanced monitoring, achieving accurate trend forecasts and dynamic alerting, and plans to extend automation for degradation handling and recovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringAIOperationsLSTMtrend prediction
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.