Big Data 11 min read

Cost‑Effective Real‑Time Data Warehouse 2.0: Migrating from Kafka to Iceberg

iQIYI transformed its real‑time data warehouse by replacing a costly Kafka‑based Lambda stack with a unified stream‑batch Iceberg lake, cutting storage expenses by 90%, halving compute costs, extending data retention, and delivering minute‑level freshness for 90% of use cases while preserving second‑level processing where needed.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Cost‑Effective Real‑Time Data Warehouse 2.0: Migrating from Kafka to Iceberg

In iQIYI's pan‑entertainment ecosystem, data is the core engine driving business growth. Real‑time data requirements have permeated the entire chain—from video playback and membership operations to ad recommendation—requiring sub‑minute feedback (e.g., user click events must be reflected in recommendation models within one minute). As the business scale grew, the second‑level real‑time warehouse built on Kafka faced significant challenges, especially high storage and compute costs.

Existing Architecture Cost Issues

The early real‑time warehouse used a Lambda architecture with Kafka + Flink for second‑level processing, but its cost was more than ten times that of an Iceberg‑based storage solution. Specific problems included:

Storage limitation: Kafka retains data only for a few hours with multiple replicas, making it unsuitable for long‑term storage and historical analysis.

Resource waste: Parallel real‑time and offline pipelines duplicate computation, especially during data cleaning and processing.

Over‑designed latency: Most scenarios only need minute‑level delay, making second‑level processing excessive.

Technical Transformation for Cost Reduction and Efficiency

To address these issues, a unified stream‑batch architecture was introduced, centering on an Iceberg data lake to achieve minute‑level real‑time warehousing. The transformation brought:

Storage cost reduced by 90%: Iceberg’s columnar storage on HDFS with compression costs only 1/10 of Kafka.

Tiered latency: Seconds‑level scenarios still use Kafka; minute‑level scenarios migrate to Iceberg, allocating resources on demand.

Enhanced data governance: Large, monolithic streams are split into thematic micro‑streams, which can be recombined to satisfy diverse business needs.

This shift cut daily PB‑scale processing costs by 60% and upgraded data freshness from hour‑ or day‑level to minute‑level near‑real‑time, dramatically improving business response speed and decision efficiency.

Kafka vs Iceberg Comparison

Feature

Kafka

Iceberg

Latency

Second‑level

Minute‑level (configurable to 1 min)

Storage Cost

High (multiple replicas)

Low (HDFS + compression)

Data Governance

Weak (message queue only)

Strong (schema management, version control)

Compute Engine Support

Flink/Spark Streaming

Flink/Spark/Trino

Applicable Scenarios

Second‑level real‑time (e.g., fraud detection)

Minute‑level analytics, historical back‑track

Real‑Time Warehouse 2.0 Architecture

The DWD layer tables were migrated to Iceberg, and data processing was refactored into Flink jobs. Migration steps included:

Switch core data (playback business) to the new pipeline.

Abstract offline parsing logic into a unified Pingback SDK, enabling consistent deployment for both real‑time and offline paths.

Run dual pipelines for two months to compare and monitor data consistency.

After validation, perform a seamless cut‑over to the new architecture.

Post‑migration benefits:

95% storage resource savings and 50% compute resource savings while maintaining data safety.

Iceberg output simplifies SQL development, aligning with Hive usage.

Data retention extended from a few hours to a month or longer.

Cost Optimization Results

Storage cost reduced by 90% (Iceberg costs 1/10 of Kafka).

Compute resources reused: the same data serves both Flink streaming and Spark batch, boosting cluster utilization by ~40%.

Long‑term data retention eliminates repeated ETL caused by Kafka data expiration.

Overall effect: millions of yuan saved annually.

Timeliness Improvements

Previously, high‑real‑time requirements (e.g., heartbeat timing) were limited by cost, causing minute‑level delays. The upgraded architecture introduces a tiered timeliness design:

Second‑level: Kafka continues to support recommendation, user‑generated content, and ad scenarios.

Minute‑level: Iceberg supports 1‑5 minute latency, covering 90% of real‑time metrics.

Heartbeat timing: Flink + Iceberg incremental processing enables per‑second writes, achieving a ten‑fold speedup in playback duration feedback.

Business Value Release

Switching playback duration reporting from batch to heartbeat flow enables faster user behavior feedback, supporting real‑time updates of user tags, feature vectors, and recommendation models, and broadening operational flexibility.

Future Plans

Real‑time Warehouse 2.0 will continue to promote minute‑level data sources downstream, further optimize architecture, expand data sources, and enrich analysis scenarios to provide stronger data support for business growth.

flinkStreamingKafkacost-optimizationreal-time data warehouseSparkIceberg
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.