Trino in Bilibili Lakehouse: Compute Engine, Stability, and Containerization Practices
This article presents Bilibili's practical implementation of Trino within a lakehouse architecture, focusing on the compute engine placement, stability enhancements, and containerized deployment, while detailing indexing strategies, pre‑computation techniques, Iceberg metadata optimizations, and performance gains for large‑scale analytical queries.
Overview
The presentation introduces Bilibili's lakehouse integration of Trino, outlining its role in the overall architecture where data is ingested in real‑time or batch, stored on HDFS with Iceberg as the table format, and Trino runs in a containerized environment.
1. Compute Engine Position
Trino is positioned as the query engine atop Iceberg tables, with data stored in HDFS. It supports various data ingestion methods and leverages Iceberg features such as Optimize, Sorting, OpenAPI, Index Building, and Cube Building to reduce user overhead.
2. Indexing Work
Multiple index types are implemented, including Min/Max, BloomFilter, BitMap, BloomRF, TokenBloomFilter, and TokenBitMap, with lightweight indexes stored in Iceberg manifests and heavyweight indexes in separate files. Trino schedules splits that carry index metadata, allowing workers to skip irrelevant files during query execution.
3. Pre‑computation
Pre‑computation optimizes aggregation and join operations by defining virtual association columns and generating CUBE files via Iceberg actions. This reduces query latency dramatically, enabling sub‑second responses for complex joins.
4. Iceberg Statistics Optimization
Trino leverages Iceberg manifest statistics (min/max/count) to avoid full data file reads, and utilizes sort order metadata to optimize Sort and TopN nodes, achieving significant performance improvements in both aggregation and log query scenarios.
5. Stability Enhancements
Stability is improved by restricting cross joins, disallowing pure SortNode queries, and enforcing partition predicates. A transparent upgrade mechanism using the Yuuni HTTP proxy enables seamless cluster migrations without user impact.
6. Containerization
Two deployment models are discussed: traditional physical machines and containerized environments. Containerization offers resource pooling, custom CPU/memory allocation, baseline core management, mixed deployment, easier rollback, and higher CPU utilization, though it currently lacks some operational tools.
Conclusion
The session summarizes the practical experiences and lessons learned from deploying Trino in a lakehouse setting, highlighting the benefits of indexing, pre‑computation, Iceberg metadata usage, stability safeguards, and containerized operations.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.