Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg
Bilibili’s new lake‑house platform, built on Trino and Iceberg, replaces Hive‑based pipelines by ingesting logs and DB data into Iceberg tables, applying advanced sorting, Z‑order/Hilbert clustering, bitmap and bloom indexes, virtual join columns and pre‑aggregation, enabling 70 000 daily queries on 2 PB with average scans of 2 GB and sub‑2‑second response times.
In a Hadoop Meetup 2022 Shanghai session, Li Chengxiang, the head of Bilibili's OLAP platform, presented how Bilibili built an efficient lake‑house platform based on Trino and Iceberg.
Bilibili’s data originates from three main sources: application/web event logs, server logs, and business system databases. These are ingested into a big‑data platform as offline HDFS files and real‑time Kafka streams.
Data engineers process the raw data using Hive and Spark for batch workloads and Flink for streaming, building layered data models. Users can query the processed tables directly with Spark or Trino, but many use cases (BI, data services, full‑text search) still require exporting Hive tables to external stores such as ClickHouse, Redis, or Elasticsearch, which adds synchronization cost and creates data silos.
To address these issues, Bilibili introduced a Trino + Iceberg lake‑house architecture. The goals are to accelerate queries that previously accessed Hive via Trino/Spark and to eliminate unnecessary data movement to external stores by serving queries directly from Iceberg tables.
The overall architecture consists of ETL jobs (Spark/Flink) that read and write Iceberg tables, while external services query those tables through Trino. A self‑developed service called Magnus receives commit notifications from Iceberg and asynchronously launches Spark tasks to optimize table layout (e.g., small‑file merging, data sorting).
Performance targets focus on typical SPJA (Select‑Project‑Join‑Aggregate) workloads, where query result sets are small. By enhancing Iceberg and Trino with sorting, indexing, and pre‑computation, the engine can skip irrelevant data files, limiting scanned data to a predictable range.
Projection is handled by ORC columnar storage, allowing the reader to fetch only required columns. Filter push‑down is achieved through Trino’s optimizer and Iceberg’s partition pruning and min‑max statistics, enabling data skipping at the file level.
To improve skipping for high‑cardinality filters, Bilibili extended Iceberg’s data distribution capabilities. Besides traditional Hash and Range ordering, they added Z‑order and Hilbert‑curve ordering. Z‑order maps multi‑dimensional values to a one‑dimensional space, preserving locality for all sorted columns, while Hilbert curves avoid long spans, offering better clustering for certain workloads.
Indexing enhancements include Bitmap indexes for range filters, Bloom‑range filters that provide space‑efficient approximate range indexing, and the use of Iceberg metadata (min‑max values) during both InputSplit generation (Coordinator) and per‑file processing (Worker) in Trino.
For star‑schema joins, Bilibili defined virtual join columns on Iceberg tables. These logical columns allow filter conditions from dimension tables to be pushed down to fact tables, enabling data skipping on fact‑table files and significantly improving join query performance.
Pre‑aggregation is also supported: Iceberg metadata can directly answer count/min/max queries, and a future pre‑compute cube framework is under development to accelerate more complex aggregations.
Operational metrics show a Trino cluster of 5,376 cores handling 70,000 daily queries on 2 PB of data. Thanks to sorting and indexing, average queries scan only 2 GB, with a P90 response time under 2 seconds, meeting the goal of a second‑level responsive lake‑house platform.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.