How Baidu’s Log Platform Cuts Billions in Cost with Full‑Lifecycle Event Governance
This article details Baidu's log platform point‑governance practice, explaining why uncontrolled event logging inflates storage and compute costs, and describing a three‑stage solution—manual, semi‑automatic platform, and full‑lifecycle standardization—that uses anomaly detection, automated workflows, and IM bots to achieve massive PV reduction and annual cost savings.
In Baidu’s ecosystem, "point" (打点) refers to embedded statistical code that records user actions such as clicks and swipes, generating massive logs used for reporting, A/B testing, and personalization. Daily, billions of point logs are produced, consuming hundreds of petabytes of storage and incurring high compute costs.
Problem Analysis
As business iterates, point logs continuously grow in volume and length, leading to unstable point services, increased storage, and compute demands. Key challenges include locating useless points, detecting abnormal points, trimming fields, and ensuring stability during feature rollouts or high‑traffic events.
Solution Overview
The governance approach is divided into three phases:
Manual Governance : Direct communication between the log platform team and product owners to understand point usage, analyze PV spikes, and apply customized mitigation strategies.
Semi‑Automatic Platform Governance : A platform that automates the workflow, providing a DAG‑based process, four governance modes (demand surge, anomaly fix, activity traffic, point optimization), and integrates IM bots for one‑click group creation and templated notifications.
Full‑Lifecycle Standardized Governance : A standardized architecture that continuously handles point retirement, anomaly repair, redundancy reduction, and feature‑based classification, enabling repeatable, efficient interventions.
Phase Details
1.1 Manual Governance
Focuses on understanding diverse point purposes (e.g., activity, experiment, demand) and delivering flexible, user‑centric governance measures. While effective initially, scaling becomes difficult as more points and business lines are added.
1.2 Semi‑Automatic Platform Governance
Implements scheduled tasks that collect daily PV per point, apply anomaly detection algorithms, and flag abnormal points. The platform provides three status pages (to‑govern, in‑govern, completed) and records state transitions in a MySQL database for real‑time analytics and visualization. An IM robot automates group creation, sends templated alerts, and @‑mentions relevant owners to accelerate the workflow.
1.3 Full‑Lifecycle Standardized Governance
Classifies points by characteristics (single‑demand, composite, activity, cascade, experiment, framework) and applies tailored strategies. It continuously identifies useless points (no traffic or traffic without downstream usage) and removes them after a review process. It also merges redundant points, samples high‑volume logs, and fixes abnormal points through a defined detection‑confirmation‑repair pipeline.
Key Outcomes
The governance project has identified over ten potential risk points, reduced daily log reports per user by hundreds of lines, and saved millions of dollars in annual compute and storage costs by handling billions of PVs each year. It also improves point quality, supports business growth, and lays groundwork for future event‑based PV governance.
Visuals
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
