Lakehouse Architecture Practice at Bilibili: Query Acceleration and Index Enhancement
Bilibili’s lakehouse architecture merges Iceberg‑based data lake flexibility with data‑warehouse efficiency, using Kafka‑Flink real‑time ingestion, Spark offline loads, Trino queries, Alluxio caching, Z‑Order/Hilbert sorting, and enhanced BloomFilter and bitmap indexes to boost query speed up to tenfold while drastically cutting file reads.
This article introduces Bilibili's exploration and practice in query acceleration and index enhancement under the data lake and data warehouse integrated architecture. The content covers four main areas:
1. What is Lakehouse Architecture: Data lake provides flexibility with unified storage, support for various data types (structured, semi-structured, unstructured), open compute engines, and flexible processing interfaces, but with lower data quality. Data warehouse offers strong schema, closed data formats, high query efficiency, and high data quality. The lakehouse aims to combine the flexibility of data lake with the efficiency of data warehouse.
2. Bilibili's Lakehouse Architecture: Using Iceberg as the core, with real-time data from Kafka processed by Flink and written to HDFS in Iceberg format, offline data written via Spark. The Magnus service performs continuous data organization and optimization. Trino is used as the query engine with Alluxio for caching metadata and index data.
3. Data Sorting and Organization: For multi-dimensional analysis scenarios (star schema), the key is to read only required data through data organization and indexing. The article explores Z-Order sorting (interleaving multi-dimensional data into one-dimensional space while preserving clustering for all dimensions) and Hilbert curve sorting (better than Z-Order as it avoids large-span connection lines). A Boundary-based interleave Index method is introduced to ensure Z-values start from positive integers.
4. Index Enhancement: Beyond Iceberg's built-in MinMax index, Bilibili implemented BloomFilter index for equality queries and Bitmap index for range queries. To address Bitmap's limitations (range filtering performance and storage cost), they introduced Range Encoded Bitmap and Bit-Slice Encoded Bitmap, reducing 256 cardinality bitmaps to just 9. The combined approach achieves 1-10x query performance improvement and 0-400x reduction in file reads.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.