Big Data 14 min read

Trino in Bilibili Lakehouse: Compute Engine, Stability, and Containerization Practices

This article presents Bilibili's practical implementation of Trino within a lakehouse architecture, focusing on the compute engine placement, stability enhancements, and containerized deployment, while detailing indexing strategies, pre‑computation techniques, Iceberg metadata optimizations, and performance gains for large‑scale analytical queries.

DataFunSummit

Sep 25, 2023

Trino in Bilibili Lakehouse: Compute Engine, Stability, and Containerization Practices

Overview

The presentation introduces Bilibili's lakehouse integration of Trino, outlining its role in the overall architecture where data is ingested in real‑time or batch, stored on HDFS with Iceberg as the table format, and Trino runs in a containerized environment.

1. Compute Engine Position

Trino is positioned as the query engine atop Iceberg tables, with data stored in HDFS. It supports various data ingestion methods and leverages Iceberg features such as Optimize, Sorting, OpenAPI, Index Building, and Cube Building to reduce user overhead.

2. Indexing Work

Multiple index types are implemented, including Min/Max, BloomFilter, BitMap, BloomRF, TokenBloomFilter, and TokenBitMap, with lightweight indexes stored in Iceberg manifests and heavyweight indexes in separate files. Trino schedules splits that carry index metadata, allowing workers to skip irrelevant files during query execution.

3. Pre‑computation

Pre‑computation optimizes aggregation and join operations by defining virtual association columns and generating CUBE files via Iceberg actions. This reduces query latency dramatically, enabling sub‑second responses for complex joins.

4. Iceberg Statistics Optimization

Trino leverages Iceberg manifest statistics (min/max/count) to avoid full data file reads, and utilizes sort order metadata to optimize Sort and TopN nodes, achieving significant performance improvements in both aggregation and log query scenarios.

5. Stability Enhancements

Stability is improved by restricting cross joins, disallowing pure SortNode queries, and enforcing partition predicates. A transparent upgrade mechanism using the Yuuni HTTP proxy enables seamless cluster migrations without user impact.

6. Containerization

Two deployment models are discussed: traditional physical machines and containerized environments. Containerization offers resource pooling, custom CPU/memory allocation, baseline core management, mixed deployment, easier rollback, and higher CPU utilization, though it currently lacks some operational tools.

Conclusion

The session summarizes the practical experiences and lessons learned from deploying Trino in a lakehouse setting, highlighting the benefits of indexing, pre‑computation, Iceberg metadata usage, stability safeguards, and containerized operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing Query Optimization containerization Iceberg Trino Lakehouse Precomputation

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.