Big Data 15 min read

Ensuring No Duplicate and No Loss in Baidu Log Middle Platform: Architecture, Challenges, and Solutions

This article explains the design, implementation, and future plans of Baidu's log middle platform, detailing its lifecycle management, service architecture, data reliability goals of eliminating duplication and loss, and the technical measures taken across SDKs, servers, and streaming pipelines to achieve near‑100% data integrity.

Top Architect
Top Architect
Top Architect
Ensuring No Duplicate and No Loss in Baidu Log Middle Platform: Architecture, Challenges, and Solutions

1 Overview

The Baidu log middle platform provides a one‑stop service for the entire lifecycle of logging data, enabling quick collection, transmission, management, and analysis for product operation, performance monitoring, and operational management scenarios.

1.1 Platform Positioning

The platform offers end‑to‑end log data management, allowing developers to integrate logging with minimal effort and supporting downstream analytics, performance tracking, and operational insights.

1.2 Integration Status

Coverage: Almost all internal apps, mini‑programs, and incubated products are integrated.

Scale: Billions of log entries per day, peak QPS in the millions, service stability 99.9995%.

1.3 Terminology

Client: Software running on user devices (e.g., Baidu APP, mini‑programs).

Server: Backend services handling client requests.

Log Middle Platform: End‑to‑end logging solution including SDKs, servers, and management consoles.

Logging SDK: Collects, packages, and reports logs from various client environments.

Logging Server: Core log ingestion service.

Feature/Model Service: Entry point for downstream recommendation systems.

1.4 Service Panorama

The platform consists of a foundation layer, management platform, business data applications, and product support. In June 2021, Baidu released a client log reporting specification.

2 Core Goals

The platform must guarantee data accuracy, which is broken down into two requirements: no duplication ("no repeat") and no loss ("no drop"). Achieving near‑100% compliance requires addressing multiple challenges across the data pipeline.

2.1 Architecture

Log data flows from client production through ingestion, persistence, streaming, and finally to downstream real‑time or offline consumers.

2.2 Problems

Monolithic logging server with tightly coupled functions and many fan‑out streams.

Direct message‑queue integration risks data loss and cannot meet strict no‑duplicate/no‑loss requirements.

Lack of business tier separation leads to mutual impact between core and non‑core services.

3 Implementation of No Duplicate and No Loss

3.1 Theory of No Data Loss

Data loss can occur at the client (environment issues), ingestion layer (server failures), and computation layer (stream processing). Ensuring end‑to‑end reliability requires persistent storage before business processing and careful stream design.

3.1.1 Logging Server Optimizations

Prioritize persistence to reduce loss caused by server failures.

Decompose the monolithic service into lightweight components.

Design flexible streaming pipelines that support both strict no‑loss real‑time streams and high‑throughput, slightly tolerant streams.

3.1.1.1 Persistent First

Persist logs at the ingestion layer before any business logic.

Use disk‑plus‑Minos forwarding to achieve minute‑level latency while minimizing loss.

3.1.1.2 Service Decomposition & Function Offloading

Separate real‑time, high‑throughput, and other business streams into distinct services, isolate resources, and apply appropriate QoS policies.

3.1.2 Stream Processing Design

Logging server forwards real‑time streams to dedicated message queues.

Flow splitting directs low‑QPS points to individual queues and aggregates higher‑QPS points.

Business flows can deploy isolated jobs for custom processing.

Global deduplication is performed using unique identifiers (e.g., MD5) at the business filter stage.

3.2 SDK Reporting Optimizations

To mitigate client‑side loss, logs are cached locally and sent asynchronously. Optimizations include adding reporting triggers (timers, threshold‑based, and on‑business events) and adjusting batch sizes for efficient transmission.

These improvements increased the overall data convergence rate by over 2%.

4 Outlook

Future work will focus on eliminating disk‑failure‑induced loss, further strengthening persistence mechanisms, and continuously enhancing the platform to provide reliable, accurate logging data for business decision‑making.

For readers interested in interview questions from major tech companies, a QR‑code is provided to obtain a curated BAT interview question set.

backend architecturebig datastream processingData Reliabilitylog platformlogging SDK
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.