Big Data 15 min read

Intelligent Optimization of Bilibili’s Iceberg‑Based Lakehouse for Query Acceleration

This article describes Bilibili’s intelligent optimization project that automatically analyzes historical query workloads to configure multi‑dimensional sorting, various indexes, and pre‑aggregation on Iceberg tables, thereby reducing scan volume by 28% across dozens of tables and improving OLAP query latency.

DataFunSummit
DataFunSummit
DataFunSummit
Intelligent Optimization of Bilibili’s Iceberg‑Based Lakehouse for Query Acceleration

Background – As data volumes and query complexity grow, Bilibili’s lakehouse built on Apache Iceberg faces performance and stability challenges. Traditional manual tuning (multi‑dimensional sorting, Z‑order, Hilbert Curve, Bloom/Bitmap indexes, pre‑aggregation) requires high technical expertise, limiting adoption.

Intelligent Optimization Goal – Automatically analyze users’ historical queries, recommend and apply optimal data organization (distribution, sorting, indexing, pre‑computation) to improve query efficiency without manual intervention.

Lakehouse Architecture – Iceberg tables stored on HDFS, ingest via Spark (batch), Flink/Java SDK (streaming). Interactive analysis uses Trino with Alluxio caching; high‑performance queries may export data to ClickHouse or Elasticsearch. The platform hosts ~2,000 Iceberg tables (~40 PB) with ~100 TB daily ingest and >4 M Trino queries per day (P99 latency ≈ 3 s).

Optimization Techniques

Multi‑dimensional sorting (Z‑order, Hilbert Curve) to improve data locality.

Indexes: Bloom Filter for equality, Bitmap for range, custom log‑specific indexes.

Pre‑aggregation to generate materialized aggregates per file.

File‑level sorting for Top‑N queries.

Intelligent Optimization Workflow – Two engines: Trino for query capture, Spark for data optimization. A query‑capture service records query metadata into an Iceberg table. The Magnus service periodically analyzes this data, generates recommendations, and writes them back to Iceberg tables. Spark asynchronously applies the recommendations via commit events, supporting minor (small‑file merging) and major (sorting, distribution, indexing) optimizations with snapshot‑based scheduling to keep latency low.

Results – Over 30 tables have been auto‑optimized; in the subsequent 30 days scan volume dropped by 28%, with >60 % of tables achieving >30 % reduction. The system also provides a front‑end dashboard showing partition‑level statistics and the impact of each optimization.

Future Plans – Improve recommendation accuracy using richer statistics and machine‑learning models, extend support for more query scenarios (log‑specific indexes, multi‑table pre‑aggregation), and roll out intelligent optimization to more production tables.

Overall, the intelligent optimization framework automates the selection and application of advanced Iceberg features, delivering measurable query performance gains while reducing operational overhead.

Big Datamachine learningQuery Optimizationdata warehouseSparkicebergTrino
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.