Big Data 18 min read

Lakehouse Architecture at Bilibili: Query Acceleration and Index Enhancement Practices

This article explains Bilibili's lake‑warehouse integrated architecture, describing how Iceberg, MagnuS, Trino, and Alluxio are used to achieve flexible data storage, high‑performance query acceleration, and automated indexing through Z‑Order, Hilbert curve, Bloom filter, and advanced BitMap techniques.

DataFunTalk
DataFunTalk
DataFunTalk
Lakehouse Architecture at Bilibili: Query Acceleration and Index Enhancement Practices

The article introduces the concept of a lake‑warehouse (lakehouse) architecture, explaining the characteristics of data lakes—unified storage, support for all data types, and openness to multiple compute engines—and data warehouses—strong schema, high query efficiency, and reliable data quality.

It then outlines Bilibili's current big‑data platform built on Hadoop (HDFS, Hive, Spark, Presto) and the need for a dedicated distributed warehouse (ClickHouse) for interactive analytics, highlighting the challenges of data duplication and consistency when moving data between the lake and warehouse.

Three primary goals of Bilibili's lakehouse are presented: (1) retain the flexibility of a data lake using unified HDFS storage and seamless integration with Spark, Flink, Presto, etc.; (2) achieve the efficiency of a data warehouse by optimizing data organization, indexing, and pre‑computation, especially using Iceberg tables; (3) provide a low‑friction, intelligent user experience that automates data sorting, indexing, and query optimization.

The core of Bilibili's lakehouse is Iceberg, chosen over Hudi and Delta Lake. Real‑time data flows from Kafka through Flink into Iceberg on HDFS, while batch data is written via Spark. The MagnuS service continuously optimizes Iceberg tables (sorting, indexing) using Spark jobs, and Trino serves as the query engine with Alluxio caching metadata and index data.

To improve query performance, the article discusses data sorting strategies. Using a star‑schema benchmark, it shows how Z‑Order sorting (interleaving bits of multiple columns) and Hilbert curve ordering can significantly reduce data scanning. It also describes the challenges of preserving ordering across multiple filter columns and introduces a boundary‑based interleave index that ensures positive integer Z‑Values.

Index enhancements include file‑level Bloom filters for equality queries and BitMap indexes for range and multi‑condition queries. To address the high storage cost of BitMaps, the article introduces Range‑Encoded BitMaps (allowing efficient range queries with at most two BitMaps) and Bit‑Slice Encoded BitMaps (compressing many cardinalities into a few BitMaps). Combining these techniques yields 1‑10× query speedups and up to 400× reduction in file reads.

Finally, the article notes the implementation of new SQL APIs in Iceberg and Spark for specifying file‑level sorting (distributed by, locally ordered by) with options such as hash, range, Z‑Order, and Hilbert curve, and provides configuration guidelines to achieve minimal file access in multi‑dimensional analytical workloads.

big datadata warehouseIndex OptimizationicebergQuery AccelerationlakehouseZ-Order
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.