How Weibo’s Hubble Platform Uses AI for Real‑Time Monitoring and Trend Forecasting
The article details Weibo Advertising's Hubble monitoring system, describing its three‑layer architecture, metric taxonomy, AI‑driven trend prediction with LSTM models, dynamic alert thresholds, and performance testing using GoReplay, illustrating how large‑scale data and machine learning enable proactive operations.
Background Introduction
The Weibo advertising system integrates traffic distribution, delivery, settlement, CTR estimation, and CRM, and its rapid growth has made monitoring a massive challenge, requiring thousands of modules to compute and communicate continuously.
Overall Architecture
Hubble relies on two platforms: D+ (a commercial big‑data foundation handling data collection, storage, computation, and providing API interfaces) and Hubble (an intelligent panoramic monitoring and insight platform). The architecture consists of three layers:
Data Collection Layer : Real‑time ingestion of system logs, metrics, business logs, and business metrics using tools such as Flume, Scribe, Filebeat, Metricbeat, and a custom lightweight w‑agent client managed via ZooKeeper.
Data Analysis Layer : ETL, preprocessing, and aggregation; data is persisted to HDFS, with offline model training stored in HDFS and an alert trigger module applying rules.
Visualization Layer : Stores processed data in Druid, Elasticsearch, MySQL, ClickHouse; provides dashboards, API access, and alarm management.
Core Function Analysis
1. Panoramic Monitoring
Basic monitoring aggregates time‑series data at a granularity of one second for key metrics, using D+ for real‑time data services. The system visualizes both machine‑level health (healthy, sub‑healthy, unhealthy) and service‑level status, displaying alerts and topology maps.
2. Trend Prediction
Weibo advertising adopts machine‑learning‑based forecasting (LSTM) to predict system metric trends, outperforming traditional statistical methods such as ARIMA. The pipeline includes offline model training on eight days of historical data (window length = 3, 66.7% training split) and online inference using Kafka‑fed real‑time data, with results stored in Druid for dashboard display.
LSTM model details: input‑gate, forget‑gate, output‑gate with tanh activation, sigmoid gates, dropout 0.2, MSE loss, RMSprop optimizer, 50 epochs, batch size 1, final dense layer with linear activation. Trained model saved as an .h5 file in HDFS.
3. Dynamic Thresholds
Static, experience‑based alert thresholds cause false positives or missed alerts. By combining trend predictions, Hubble sets dynamic thresholds (e.g., ±10% of the forecast curve) that adapt to metric volatility.
4. Service Governance
To ensure high availability under traffic spikes, the system emphasizes proactive risk detection, capacity planning, and automated mitigation. Real‑world traffic surges (e.g., viral events) demand rapid scaling, which is evaluated through online traffic replay.
Performance Evaluation and Traffic Replay
GoReplay, an open‑source Go‑based tool, captures live HTTP traffic and replays it against test environments. Configuration examples demonstrate capturing from port 80, forwarding to target servers, injecting custom headers, and collecting statistics.
./goreplay
--input-raw :80
--exit-after 60s
--output-http "192.168.1.1:80"
--output-http "192.168.1.2:8080|10%"
--http-set-header 'User-Agent: Gor'
--output-http-timeout 100ms
--stats
--output-http-statsReplay enables full‑link tracing with unique Trace IDs, allowing measurement of QPS, latency distribution, and error rates across services during simulated peak loads.
Conclusion
Effective monitoring starts with comprehensive metric coverage, then refines accuracy and abstracts essential signals. Weibo’s advertising infrastructure has pioneered AI‑enhanced monitoring, achieving accurate trend forecasts and dynamic alerting, and plans to extend automation for degradation handling and recovery.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
