Big Data 17 min read

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

To address the high cost and low efficiency of traditional Hadoop‑based data pipelines, Bilibili designed a lakehouse solution using Apache Iceberg, integrating Spark, Flink, Trino, and Alluxio to unify flexible data lake storage with warehouse‑level query performance, reducing data duplication and improving interactive analytics.

Big Data Technology & Architecture

Mar 31, 2022

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

Background: Bilibili ingests petabyte‑scale data daily into a big‑data platform, performing offline or real‑time ETL before serving downstream analysis, recommendation, and prediction scenarios. The existing workflow—collecting logs, events, and database records into HDFS/Kafka, then processing with Hive, Spark, or Flink and storing results as ORC—fails to meet interactive query latency requirements for many BI reports, forcing data export to external OLAP engines such as ClickHouse, HBase, or Elasticsearch.

The current pipeline suffers from two major issues: (1) additional development effort, storage redundancy, and reduced flexibility when exporting Hive tables to external systems; (2) limited performance of SQL‑on‑Hadoop for ad‑hoc exploration, leading to poor interactive response.

This article introduces Bilibili’s exploration and practice of a lake‑warehouse (lakehouse) architecture to tackle these challenges.

Why a Lakehouse?

Data Lake provides virtually unlimited storage, unified metadata management, and open storage formats (CSV, JSON, ORC, Parquet). It offers great flexibility for structured, semi‑structured, and unstructured data but suffers from management and query‑efficiency problems.

Data Warehouse (OLAP) enforces strong schemas, provides standard SQL interfaces, and delivers high‑performance query acceleration through optimized storage layouts and indexing. However, it requires separate storage and incurs data duplication.

A lakehouse aims to combine the flexibility of a lake with the performance of a warehouse. Two technical routes exist: evolving a distributed warehouse to support open formats (e.g., Redshift, Snowflake) or evolving a data lake with open query engines and new table formats (e.g., Iceberg, Hudi, DeltaLake). Bilibili chose the latter.

Bilibili’s Lakehouse Architecture

The goal is to reduce the need for exporting Hive tables to external systems and to improve SQL‑on‑Hadoop query efficiency and cost. Bilibili built the lakehouse on Apache Iceberg, which provides self‑organizing metadata, snapshot support, and pluggable read/write engines.

Key components:

Data ingestion from Kafka and HDFS into Iceberg tables using Spark/Flink.

Magnus service: an intelligent management layer that receives commit events from Iceberg tables, queues them, and schedules Spark jobs to reorganize data (compact files, apply Z‑Order, build indexes).

Trino as the query engine, with Alluxio caching Iceberg metadata and index data for faster access.

Iceberg Enhancements

Z‑Order Sorting : Iceberg stores column‑level min/max statistics, enabling file‑level pruning. By applying a global Z‑Order (interleaved order) on multiple frequently filtered columns, Bilibili achieves effective data skipping across dimensions. Spark was extended with a Z‑Order Range Partitioner and an OptimizeAction to trigger re‑organization.

Indexing :

Min/Max (built‑in) works well for sorted columns.

Bloom filters provide fast existence checks for high‑cardinality columns but only support equality‑type predicates.

BitMap indexes handle range predicates but are costly; Bilibili adopted a Bit‑sliced Encoded Bitmap technique to balance storage and query cost.

Results and Outlook

In production, the Iceberg‑based lakehouse handles petabyte‑scale data, serving tens of thousands of daily queries with 90th‑percentile latency under 1 second, satisfying interactive analytics needs. Future work includes star‑schema data organization, pre‑computation for hot query patterns, and automated query‑driven data layout adaptation.

Overall, the Iceberg lakehouse retains compatibility with the existing Hadoop stack while delivering near‑warehouse query performance, simplifying architecture, and reducing resource consumption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse Spark Iceberg Lakehouse

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.