How Baidu Scales Real‑Time Event Monitoring for Billions of Log Events
This article explains Baidu's log platform architecture, the UBC event‑tracking protocol, monitoring requirements, and the low‑cost, high‑accuracy solutions—including dimension mapping, watermark handling, data trimming, and time‑window aggregation—that enable real‑time, customizable monitoring of petabyte‑scale log streams.
Introduction
Baidu's log platform serves as a one‑stop solution for data tracking across most of its key products, handling billions of page‑view (PV) events daily. Real‑time monitoring of these logs is essential for handling traffic spikes, evolving tracking logic, and billing, yet achieving precise, stable, and user‑friendly monitoring at such scale is challenging.
Concepts & Requirements
UBC (User Behavior Collection) is the primary protocol, with three log types: UBC client logs, UBC server logs, and UBC H5 logs. Each log carries a UBC ID to distinguish user actions. Two main log categories exist:
Event tracking : records a single action (e.g., a click) and is aggregated by PV.
Stream tracking : records a continuous action (e.g., video watch) and includes a duration field; both PV and total duration are aggregated.
Parameters are divided into:
Common parameters (system type, device ID, app name/version, etc.) automatically collected by the SDK.
Business parameters defined per UBC ID to provide finer‑grained segmentation (e.g., a from field to indicate the source page of a click).
Monitoring requirements include minute‑level latency, separate aggregation rules for event vs. stream logs, and flexible filtering on both common and business parameters.
Overall Solution Architecture
The existing pipeline collects client SDK logs, stores raw logs, forwards them to a message queue, and processes them with streaming jobs. To meet monitoring goals, Baidu introduced a layered design:
Log Management Platform : maintains monitoring metadata (tracking types, extraction rules) and provides a UI for query and visualization.
Streaming Processing Tasks : periodically sync metadata, transform raw UBC logs into dimension‑enriched monitoring records, and discard redundant fields.
Monitoring Message Queue : decouples the monitoring tasks from upstream streaming jobs.
Monitoring Processing Tasks : aggregate per‑dimension metrics into time‑windowed statistics (PV, total duration) and write results to Elasticsearch and a distributed file system.
Key Challenges & Solutions
4.1 Avoiding Dimension Explosion
Business parameters can vary widely, leading to a combinatorial explosion of monitoring dimensions. Baidu limits custom filter dimensions to six and supports both one‑to‑one and many‑to‑one mapping logic, allowing unlimited business parameters without unbounded column growth.
4.2 Preventing Data Skew with Watermarks
Traditional online‑service monitoring uses wall‑clock time, causing data gaps when upstream services experience congestion or downtime. Baidu switches the X‑axis to log‑report time and introduces a watermark (the highest timestamp of fully processed data). Data before the watermark is immutable; data after it may still change, eliminating skew caused by service‑side anomalies.
4.3 Reducing Monitoring Cost
Raw logs average 10 KB per record; after dimension mapping and trimming, each record shrinks to ~0.2 KB, a 98 % reduction. Baidu then aggregates trimmed records into 5‑minute windows, storing only count (PV) and sum (duration). This reduces the data volume from hundreds of millions of rows to under 100 K rows per window (≈99.98 % compression).
Conclusion & Outlook
By combining dimension mapping, watermark‑based time handling, data trimming, and time‑window aggregation, Baidu achieves a low‑latency, highly customizable, and cost‑effective monitoring solution for its massive log platform. The approach ensures stable, reliable observation of user behavior while keeping storage and computation costs manageable, and it sets a foundation for future enhancements in observability and reliability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
