Big Data 12 min read

Real-Time Data Warehouse at iQIYI Video Production Using Spark and ClickHouse

To meet iQIYI video production’s thousands‑QPS, petabyte‑scale, frequently‑updated data and large‑table join requirements, the team built a Spark‑plus‑ClickHouse real‑time warehouse that streams Kafka changes, joins HBase dimensions, and writes to ClickHouse, reducing reporting development time from days to hours while supporting both offline and real‑time analytics.

iQIYI Technical Product Team

Apr 9, 2021

Real-Time Data Warehouse at iQIYI Video Production Using Spark and ClickHouse

iQIYI’s video production generates thousands of QPS of real‑time data and petabyte‑scale storage, making ad‑hoc queries and multi‑table joins a major challenge.

The main pain points are:

1) Real‑time requirements demand an immediate solution. 2) Production data updates frequently, so OLAP must support updates. 3) Large tables (hundreds of GB) for stream attributes and program attributes need to be joined for analysis.

Production data originates from an OLTP data platform, persisted in MongoDB, with changes streamed through Kafka (curData = current update, oriData = historical change). This structured change log enables configuration‑driven development.

The iQIYI video production team builds a middle‑platform for assets, editing, operations, and images, and provides monitoring and reporting to improve efficiency.

To address the above issues, the team adopted a Spark + ClickHouse real‑time data warehouse. Business data is processed by Spark/Spark Streaming, dimension data is stored in HBase, real‑time joins are performed, and the result is written to ClickHouse for ad‑hoc queries.

Results: reporting development time shrank from days to hours, satisfying frequent updates and both real‑time and offline joinable reports.

Background and Development History

Early stage 1: Used an internal BabelX tool with Hive for offline reporting. This incurred high cost, low latency, and heavy ETL maintenance.

Early stage 2: MySQL performance bottleneck led to the introduction of ClickHouse for real‑time reporting. Alternatives such as Druid and Kudu were evaluated; ClickHouse was chosen for its engine flexibility and support for frequent updates.

However, early ClickHouse usage lacked join support, only JDBC/ODBC connections, and the Merge engine did not support updates, causing data loss during truncation.

Spark + ClickHouse Real‑Time Warehouse

ClickHouse is a column‑oriented DBMS designed for OLAP with high performance on real‑time updates. Spark provides a unified engine for batch and streaming, with Spark Streaming offering micro‑batch processing.

The combined solution leverages ClickHouse’s fast update capability and Spark Streaming’s suitability for writing to ClickHouse.

Construction consists of three parts:

1) Offline data processing: Spark imports MongoDB data into Hive, performs ETL, and loads the result into ClickHouse.

2) Real‑time data processing: Spark Streaming consumes Kafka messages (curData/oriData) and writes directly to ClickHouse.

3) Join handling: Because both tables are hundreds of GB, HBase is used as a dimension store; Spark joins dimension data from HBase with fact data before writing to ClickHouse.

ClickHouse Support for Frequent Updates

Two engines are recommended:

ReplacingMergeTree – uses an ID as the primary key to overwrite duplicate rows (does not guarantee elimination of all duplicates).

VersionedCollapsingMergeTree – adds collapse‑row logic during merge.

Offline synchronization offers two schemes:

• Scheme 1: Incremental sync with ReplacingMergeTree – Mongo → Hive (incremental) → ClickHouse. Low pressure but requires stable dimension tables.

• Scheme 2: Full‑load sync with MergeTree – periodic Hive → ClickHouse truncate → load recent N days. Handles dimension changes but incurs high load.

Real‑time synchronization also offers two schemes:

• Scheme 1: Incremental sync with VersionedCollapsingMergeTree – one‑time load of historical data, then stream Kafka to ClickHouse.

• Scheme 2: Incremental sync with ReplacingMergeTree – similar to Scheme 1 but simpler, though duplicates may appear.

Data Accuracy Guarantees

Offline: When using a Merge engine, re‑run jobs drop the target partition; full‑load jobs truncate the table before loading.

Real‑time: Spark does not auto‑commit Kafka offsets; offsets are committed manually after successful processing, ensuring at‑least‑once consumption. ClickHouse’s ReplacingMergeTree writes are idempotent. ClickHouse performs periodic merges (default every 5 minutes) on recent partitions to consolidate data.

Configuration‑Driven Development

To reduce repetitive report‑development effort, the team built a configuration‑driven framework using Apache Commons CLI for command‑line parsing, enabling rapid construction of job parameters.

Value and Future Planning

The solution has achieved near‑zero code development for reporting, reducing cycle time from days to hours and supporting frequent updates with both real‑time and offline joinable reports. Currently, 4 offline and 3 real‑time report tasks are in production, including join scenarios.

Future work includes providing a UI for task creation, automating table creation across Hive, HBase, and ClickHouse, and further productizing the pipeline to support more business scenarios and drive iQIYI’s data platform toward full real‑time capability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering kafka ClickHouse HBase Real-Time Data Warehouse Spark

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.