Big Data 13 min read

Advertising Data Lake Architecture and Real-time Optimizations

By replacing the costly Lambda architecture with a unified data‑lake built on Iceberg and Flink CDC, the advertising team achieved minute‑level latency, strong consistency, and lower storage expenses, cutting end‑to‑end processing times from hours to a few minutes across budgeting, warehousing, OLAP and ETL workloads.

iQIYI Technical Product Team

Jun 30, 2023

Advertising Data Lake Architecture and Real-time Optimizations

Advertising data, covering effect, brand, and ADX logs, generates a large volume of logs throughout request and delivery chains. After processing, these logs feed algorithm model training, operational analysis, and delivery decisions. The business demands high timeliness, accuracy, and query performance, yet the current Lambda architecture—combining offline and real‑time pipelines—incurs high cost and data inconsistency.

To address these challenges, the advertising data team collaborated with iQIYI's big‑data team to investigate frontier data‑lake technologies. Data lakes provide massive storage, near‑real‑time latency, and interactive query efficiency, making them a good fit for advertising scenarios. The team conducted a series of experiments across different business needs.

The existing Lambda architecture consists of:

Real‑time pipeline: Spark Streaming consumes Kafka streams and writes to Kudu, deployed on a dedicated OLAP cluster, retaining only the most recent 7 days.

Offline pipeline: The latest 90 days of Hive data from a shared cluster are synchronized to independent OLAP Hive tables.

Query layer: Impala automatically merges results from Kudu and Hive based on data progress.

This design suffers from several drawbacks: multiple frameworks increase development and operational cost; offline sync introduces significant latency; the real‑time path heavily depends on Kudu, risking end‑to‑end consistency; and separate OLAP clusters cause data redundancy and extra storage expense.

Data‑lake advantages that mitigate these issues include:

Near‑real‑time writes (minute‑level latency) due to frequent commit intervals.

Unified storage for streaming and batch, eliminating the need for two heterogeneous storage systems.

Strong consistency with Exactly‑Once semantics for updates.

Lower cost by reusing existing HDFS‑scale storage.

Advertising Data Lake Applications

Real‑time Business Data Retrieval : Budget information resides in MySQL and is pulled via Sqoop, resulting in >1 hour latency. By employing Flink CDC to capture MySQL binlog and writing updates to Iceberg v2 (which supports UPDATE), the budget pipeline achieves near‑real‑time visibility. Initial issues with excessive small files were solved by adding bucket partitioning and configuring a hash‑based distributed write mode, which shuffles data by partition before committing.

Real‑time Data Warehouse : Incremental reads from Iceberg form the core of a real‑time warehouse. Large‑scale inventory data validates feasibility. Intermediate results are stored in Kafka; however, Kafka’s short retention hampers debugging. Iceberg’s high‑efficiency storage preserves intermediate data for traceability and re‑processing, keeping latency under 5 minutes. Small‑file problems are addressed with bucket partitioning, snapshot limits, and checkpoint management.

Real‑time OLAP Analysis : The original setup writes real‑time data to Kudu and offline data to an OLAP Hive cluster, queried by Impala. To relieve pressure and improve freshness, both streams are now written to Iceberg. Using the Qilin hourly report as an example, Flink CDC syncs MySQL tables to Redis for dimension tables, while traffic logs join Redis dimensions asynchronously, producing ODS tables that flow into Kafka and finally into Iceberg with shuffle and bucket partitioning. This reduces end‑to‑end latency from 2‑3 hours to 3‑4 minutes.

Real‑time ETL Data Landing : Tracking logs are split into billing and traffic streams. The traffic stream is stored in HBase (tens of TB) for long‑term retention; billing data joins the traffic data via Flink CDC, handling out‑of‑order events with up to three retry attempts, which adds ~10 minutes of delay. The joined result is written to Iceberg, with a small‑file merging strategy applied. The pipeline now achieves >99 % join success and reduces latency to a few minutes, enabling unified stream‑batch computation.

Future Outlook

The data‑lake will drive a unified stream‑batch transformation for advertising data, eliminating duplicated offline and real‑time code paths, and reducing development and maintenance costs. Progress monitoring will shift from coarse minute‑level checks to watermark‑based data‑progress detection, addressing small‑file and metadata latency issues. Further exploration includes federated queries and Flink Table Store to broaden the data‑lake’s applicability across more scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising Big Data Flink Real-time analytics Iceberg

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.