Bilibili Lakehouse Integration: Iceberg and Alluxio Optimization Practices
This article details Bilibili's lakehouse implementation using Apache Iceberg and Alluxio, covering background challenges, architectural components, data organization techniques like Z‑order and bitmap indexes, performance benchmarks, and future optimization plans for large‑scale analytics.
In this talk, the speaker from Bilibili's OLAP platform shares the practical deployment of a lakehouse solution built on Apache Iceberg and Alluxio, focusing on technical details and performance optimizations.
Bilibili processes petabyte‑scale data daily, and traditional SQL‑on‑Hadoop engines (Hive, Spark, Presto) cannot meet performance and reliability requirements, prompting the move toward a lakehouse architecture that combines the flexibility of a data lake with the efficiency of a data warehouse.
The architecture consists of three layers: (1) data ingestion—real‑time streams are consumed by Flink and written to Iceberg tables, while batch ETL jobs use Spark to load data; (2) storage optimization—an internal service called Magnus runs Spark jobs to optimize Iceberg tables; (3) interactive analysis—Trino serves queries, accelerated by Alluxio as a caching layer.
Iceberg is an open table format that separates catalog (Hive catalog in use) from metadata layers, which include JSON metadata files, manifest lists, and manifest files that store schema, partition, and min/max statistics for each data file.
Query performance is enhanced by leveraging these min/max statistics for predicate push‑down. Linear sorting can provide ordering for a single field, while Z‑order sorts data across multiple fields by interleaving bits, enabling more effective file‑level filtering, though it may suffer from poorer locality compared to Hilbert curves.
Examples illustrate how linear sorting and Z‑order affect file selection when filtering on different columns, showing that Z‑order can skip up to 50 % of files for certain predicates.
Bloom filters are introduced for fast membership checks but have limitations (false positives, only equality predicates). Bitmap indexes overcome these issues, supporting range and logical predicates without false positives. Various encoding schemes—equal‑value, range‑encoded, and bit‑sliced—are described, highlighting trade‑offs in storage and query efficiency.
Benchmark results (SSB test) demonstrate that combining Z‑order with bitmap indexes yields 1‑10× faster query times, 1‑400× fewer file reads, and dramatically reduced CPU usage compared to a basic setup.
Alluxio is employed to cache Iceberg metadata and index files, reducing metadata access latency and stabilizing query performance. Tests show a modest 10‑25 % overhead on the first read, followed by 1.5‑2× speed‑up for remote reads and 5‑10× for local reads.
Future work includes supporting pre‑computed cubes for frequent aggregations, optimizing star‑schema joins by materializing virtual columns, caching hot data with Alluxio for SLA guarantees, and applying intelligent data optimization based on historical query analysis.
The speaker concludes by thanking the audience.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.