Design and Implementation of a Scalable Real-Time Log Monitoring Platform at Baidu
This article introduces Baidu's log platform that handles billions of daily events, explains UBC logging concepts and monitoring requirements, and details a low‑cost, high‑accuracy architecture using real‑time streaming, dimension mapping, watermarking, and time‑window aggregation to achieve reliable, scalable event monitoring.
The Baidu log platform serves as a one‑stop solution for data collection across major products, processing billions of daily user‑behavior events (UBC logs) and supporting both event‑type and stream‑type data.
UBC (User Behavior Collection) logs are categorized into three types—UBC client logs, UBC server logs, and UBC H5 logs—each carrying a unique UBC ID and a set of public parameters (device, OS, app version) plus optional business‑specific parameters.
Monitoring requirements include minute‑level latency, accurate PV counting for event logs, PV + duration aggregation for stream logs, and flexible filtering by both public and business parameters.
The proposed solution abandons pure online‑service monitoring in favor of a streaming‑task‑driven pipeline: logs are ingested, stored, and then processed by Flink‑style streaming jobs that apply dimension mapping (limiting custom dimensions to six, supporting 1‑to‑1 and many‑to‑1 mappings) to prevent dimension explosion.
Watermarking is introduced to use log‑report timestamps as the X‑axis, ensuring that data before the watermark is stable while data after it may still change, thus eliminating data‑shift caused by server congestion.
To reduce storage and compute costs, the pipeline performs data trimming (dropping raw fields after mapping) and time‑window aggregation (e.g., 5‑minute windows) that output only count (PV) and sum (duration) per dimension combination, compressing billions of raw rows to a few hundred thousand aggregated records.
The architecture consists of four main components: a Log Management Platform for metadata, a streaming processing task for dimension extraction and trimming, a monitoring message queue to decouple processing, and a monitoring aggregation task that writes results to Elasticsearch and backs them up on distributed storage.
Overall, the design achieves high‑accuracy, low‑latency monitoring with controllable dimension size, robust handling of traffic spikes, and a 99.98% reduction in data volume, providing a reliable foundation for Baidu's product analytics.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.