Big Data 14 min read

Precise Watermark Design and Implementation in Baidu's Unified Streaming-Batch Data Warehouse

The article details Baidu's precise watermark design for its unified streaming‑batch data warehouse, describing how a centralized watermark server and client ensure end‑to‑end data completeness, align real‑time and batch windows with 99.9‑99.99% precision, and support accurate anti‑fraud calculations within the broader big‑data ecosystem.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Precise Watermark Design and Implementation in Baidu's Unified Streaming-Batch Data Warehouse

This article explores the application of precise watermark technology in Baidu's unified streaming-batch data warehouse. The author addresses critical technical challenges in real-time computing systems, including ensuring end-to-end data completeness (no data loss or duplication), aligning real-time and batch data windows (99.9%-99.99% alignment), supporting precise window calculations for real-time anti-fraud strategies, and integrating with Baidu's big data ecosystem.

The article begins by introducing the business background: the need for unified streaming-batch data warehouses to improve data timeliness while maintaining reliability comparable to offline systems. The authors explain fundamental concepts including Event Time (when user actions actually occur) and Processing Time (when the system processes data), and how Watermark addresses the challenge of determining when window calculations can be triggered in unbounded data streams.

Key technical content includes: (1) The definition and characteristics of watermark - monotonically increasing timestamps marking the oldest unprocessed work; (2) Current implementations in systems like Apache Flink using Periodic Watermarks and Punctuated Watermarks; (3) The design of a centralized watermark management system with Watermark Server (maintaining global watermark information table with topology of Source, Operator, and Sinker) and Watermark Client (reporting and requesting watermark information across streaming operators).

The precise watermark implementation requires three preconditions: logs produced in time-sorted order on server-side, log packages containing hostname and timestamp information, and point-to-point publishing to message queues ensuring single-partition log ordering. The calculation approach uses configurable precision (e.g., 99.9% or 99.99%) to balance data completeness and timeliness, computing source-side output low watermark based on server-to-log-progress mappings.

For watermark propagation between systems, the approach involves: point-to-point log publishing at upstream, embedding global watermark in each log record, parsing watermark information at downstream source, and triggering window calculations based on Watermark Server's global low watermark. Practical results show data diff between real-time and offline systems at less than 0.1% under normal conditions, and approximately 0.11%-0.12% even with source delays up to 0.1%.

Big Datastream processingWatermarkApache FlinkData Warehousereal-time computingBaiduevent timeLow WatermarkWindow Computation
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.