Ensuring Data Accuracy and Reliability in Baidu Log Platform: Architecture, Challenges, and Solutions
This article introduces the current state of Baidu's log platform, explains its lifecycle from data collection to downstream applications, analyzes the challenges of achieving near‑zero duplication and loss, and presents architectural optimizations and best‑practice recommendations to improve data stability and accuracy across the system.
1. Overview
1.1 Platform Positioning
The log platform provides a one‑stop service for point‑data, managing the full lifecycle of log data. With simple development, it enables fast collection, transmission, management, and query analysis of logs, supporting product operation analysis, R&D performance analysis, and operations management for both client and server side applications.
1.2 Integration Status
The platform now covers most key products within the company, including Baidu APP, Mini‑Programs, and Matrix APPs. Integration benefits include:
Coverage: Almost all internal APPs, Mini‑Programs, incubated APPs, and externally acquired APPs are integrated.
Service Scale: Billions of log entries per day, peak QPS of several million per second, and service stability of 99.9995%.
1.3 Terminology
Client: Software that users run directly on phones or PCs, e.g., Baidu APP, Mini‑Program.
Server: Services that respond to client requests, typically deployed on cloud servers.
Log Platform: Refers to the client‑side log platform, covering the full lifecycle of log data, including SDK, server, and management components.
Point‑Data SDK: Handles collection, packaging, and reporting of logs. Variants include APP SDK, H5 SDK, generic SDK, performance SDK, Mini‑Program SDK, etc.
Point‑Data Server: Core log reception service of the platform.
Feature/Model Service: Forwards points that require strategy/model computation to the downstream Strategy Recommendation Platform .
1.4 Service Overview Diagram
The log service consists of a foundation layer, management platform, business data applications, and product support. In June 2021, Baidu released a client log reporting specification.
Foundation Layer: Supports APP‑SDK, JS‑SDK, performance SDK, generic SDK, enabling rapid integration of various point‑data needs. Relies on big‑data infrastructure to distribute data to downstream applications.
Platform Layer: Manages metadata, controls the entire point‑data lifecycle, and ensures real‑time and offline forwarding with flow control and monitoring to achieve 99.995% stability.
Business Capability: Logs are forwarded to data centers, performance platforms, strategy platforms, growth platforms, etc., supporting product decision analysis, client quality monitoring, and growth strategies.
Business Support: Covers key APPs and new incubated Matrix APPs, providing generic components horizontally.
2. Core Goals of the Log Platform
The platform carries all APP log points and sits at the front line of data production. While ensuring comprehensive coverage and fast, flexible integration, the most critical challenge is data accuracy. Accuracy is split into two aspects:
Non‑duplication: Prevent duplicate data caused by retries or architectural fault recovery.
Non‑loss: Prevent data loss caused by system failures or code bugs.
Achieving near‑100% non‑duplication and non‑loss presents many difficulties.
2.1 Log Platform Architecture
Log data flows from the client through the platform to real‑time or offline downstream services via several stages.
Data can be consumed in different ways: Real‑time: Near‑real‑time stream (message queue): High timeliness (minutes), strict accuracy; typical for R&D platforms, trace platforms. Pure real‑time stream (RPC proxy): Second‑level timeliness, tolerates some loss; typical for recommendation systems. Offline: Full log tables with day‑level or hour‑level timeliness, requiring strict accuracy. Other: Requires a balance of timeliness and accuracy.
2.2 Problems Faced
From the architecture, several issues arise:
Monolithic Service: The point‑data server handles all processing logic, leading to heavy coupling. Multiple functions: ingestion, persistence, business logic, various forwarding (RPC, MQ, PB storage). Many fan‑out streams: over 10 downstream fan‑out flows.
Direct MQ Integration: Directly sending to message queues risks message loss and cannot meet non‑duplication/non‑loss requirements.
Lack of Business Tiering: Core and non‑core services are tightly coupled. Iterative changes in one affect the other.
3. Achieving Non‑Duplication and Non‑Loss
3.1 Theory of Data Non‑Loss
Data loss can occur at three layers:
Client side: Environmental issues (white screen, crashes, non‑persistent processes) cause loss.
Ingestion layer: Server failures (restarts, crashes) cause loss.
Computation layer: Stream processing must guarantee strict non‑duplication and non‑loss.
3.1.2 Architecture Optimization Directions
Ingestion Layer:
Persist data first, then perform business processing.
Reduce logical complexity.
Downstream Forwarding:
Real‑time streams: Strict non‑loss.
High‑timeliness streams: Allow limited loss while guaranteeing timeliness.
Resource isolation: Physically separate deployments for different businesses.
Priority differentiation: Separate data based on business urgency.
3.2 Architecture Decomposition
Based on the current state and the “only‑two‑theory”, we decompose and reconstruct the architecture.
3.2.1 Point‑Data Server Decomposition (Improving Ingestion‑Layer Loss)
Key improvements include:
Prioritize persistence to reduce loss caused by server faults.
Break the monolithic service into lightweight components.
Design flexible, easy‑to‑use streaming computation architecture.
3.2.1.1 Log Prioritized Persistence
Data is persisted before any business processing. Real‑time streams aim for minute‑level latency while minimizing loss.
Persistence: Store data before business handling to ensure it is not lost.
Real‑time stream: Use disk‑plus‑Minos forwarding to MQ, achieving minute‑level delay with minimal loss.
3.2.1.2 Monolithic Service Decomposition & Function Down‑Shift
To reduce risk from frequent feature iterations, we split the online service:
Real‑time Business: Data flows from ingestion → fan‑out → business layer → downstream.
High‑Timeliness Business: Dedicated RPC service for strategy recommendation, achieving >99.95% SLA.
Other Business: Monitoring, VIP, gray‑release services with relaxed timeliness and loss requirements, isolated into separate services.
Technology Choice: StreamCompute architecture ensures end‑to‑end non‑duplication and non‑loss.
Resulting architecture diagram:
3.2.1.3 Stream Computing Considerations
To guarantee strict data stability, a streaming computation framework is required:
Point‑data server forwards real‑time streams to a message queue, then fans out to the stream framework.
Fan‑out flow splits points based on traffic size, outputting to different queues for flexible downstream subscription.
Business flow: Independent jobs for each business need, ensuring resource isolation. Input: Combine fan‑out data for computation. Output: Send processed data to business queues for consumption. Business filter: Generates a unique identifier (e.g., MD5) for each point and performs global deduplication.
3.2.2 Point‑Data SDK Reporting Optimization (Client‑Side Loss Reduction)
Client environments cause data loss, especially under high concurrency. The SDK now stores points locally and asynchronously uploads them. Optimizations include:
Increase reporting opportunities: periodic tasks, trigger on business points, threshold‑based batch sending.
Adjust batch size: Determine optimal number of points per message to maximize timely delivery.
These client‑side improvements have increased convergence rates on both ends by over 2%.
4. Outlook
We have described the efforts made to ensure log data accuracy. Future work will continue to address risk points such as:
Disk failures causing data loss: Further strengthen persistence mechanisms based on the company's data durability capabilities.
We hope the log platform will keep evolving to provide reliable point‑data for business use.
Reference Reading:
TCP BBR Congestion Control Algorithm: Principles Behind Massive Speed Gains
Unifying Paxos and Raft: abstract‑paxos
Best Practices for Building an Integrated Monitoring Platform with Prometheus in a Micro‑service Architecture
Real‑time Incremental Learning in Cloud Music Live Recommendation System
Insights into Golang Scheduler Design from Its Author
Article translated by High‑Availability Architecture. For original technical articles and architecture practices, feel free to submit via the public account menu "Contact Us".
High‑Availability Architecture
Changing the Way the Internet Is Built
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.