Ensuring No Duplicate and No Loss in Baidu Log Middle Platform: Architecture, Challenges, and Solutions
This article explains the design, implementation, and future plans of Baidu's log middle platform, detailing its lifecycle management, service architecture, data reliability goals of eliminating duplication and loss, and the technical measures taken across SDKs, servers, and streaming pipelines to achieve near‑100% data integrity.
1 Overview
The Baidu log middle platform provides a one‑stop service for the entire lifecycle of logging data, enabling quick collection, transmission, management, and analysis for product operation, performance monitoring, and operational management scenarios.
1.1 Platform Positioning
The platform offers end‑to‑end log data management, allowing developers to integrate logging with minimal effort and supporting downstream analytics, performance tracking, and operational insights.
1.2 Integration Status
Coverage: Almost all internal apps, mini‑programs, and incubated products are integrated.
Scale: Billions of log entries per day, peak QPS in the millions, service stability 99.9995%.
1.3 Terminology
Client: Software running on user devices (e.g., Baidu APP, mini‑programs).
Server: Backend services handling client requests.
Log Middle Platform: End‑to‑end logging solution including SDKs, servers, and management consoles.
Logging SDK: Collects, packages, and reports logs from various client environments.
Logging Server: Core log ingestion service.
Feature/Model Service: Entry point for downstream recommendation systems.
1.4 Service Panorama
The platform consists of a foundation layer, management platform, business data applications, and product support. In June 2021, Baidu released a client log reporting specification.
2 Core Goals
The platform must guarantee data accuracy, which is broken down into two requirements: no duplication ("no repeat") and no loss ("no drop"). Achieving near‑100% compliance requires addressing multiple challenges across the data pipeline.
2.1 Architecture
Log data flows from client production through ingestion, persistence, streaming, and finally to downstream real‑time or offline consumers.
2.2 Problems
Monolithic logging server with tightly coupled functions and many fan‑out streams.
Direct message‑queue integration risks data loss and cannot meet strict no‑duplicate/no‑loss requirements.
Lack of business tier separation leads to mutual impact between core and non‑core services.
3 Implementation of No Duplicate and No Loss
3.1 Theory of No Data Loss
Data loss can occur at the client (environment issues), ingestion layer (server failures), and computation layer (stream processing). Ensuring end‑to‑end reliability requires persistent storage before business processing and careful stream design.
3.1.1 Logging Server Optimizations
Prioritize persistence to reduce loss caused by server failures.
Decompose the monolithic service into lightweight components.
Design flexible streaming pipelines that support both strict no‑loss real‑time streams and high‑throughput, slightly tolerant streams.
3.1.1.1 Persistent First
Persist logs at the ingestion layer before any business logic.
Use disk‑plus‑Minos forwarding to achieve minute‑level latency while minimizing loss.
3.1.1.2 Service Decomposition & Function Offloading
Separate real‑time, high‑throughput, and other business streams into distinct services, isolate resources, and apply appropriate QoS policies.
3.1.2 Stream Processing Design
Logging server forwards real‑time streams to dedicated message queues.
Flow splitting directs low‑QPS points to individual queues and aggregates higher‑QPS points.
Business flows can deploy isolated jobs for custom processing.
Global deduplication is performed using unique identifiers (e.g., MD5) at the business filter stage.
3.2 SDK Reporting Optimizations
To mitigate client‑side loss, logs are cached locally and sent asynchronously. Optimizations include adding reporting triggers (timers, threshold‑based, and on‑business events) and adjusting batch sizes for efficient transmission.
These improvements increased the overall data convergence rate by over 2%.
4 Outlook
Future work will focus on eliminating disk‑failure‑induced loss, further strengthening persistence mechanisms, and continuously enhancing the platform to provide reliable, accurate logging data for business decision‑making.
For readers interested in interview questions from major tech companies, a QR‑code is provided to obtain a curated BAT interview question set.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.