Improving Bilibili Offline Cluster Performance with Presto and Alluxio
This technical presentation explains how Bilibili reduced database pressure and query latency in its production environment by integrating Presto with Alluxio, detailing the offline cluster architecture, challenges of compute‑storage separation, caching strategies, consistency mechanisms, performance gains, and future work.
In Bilibili's online production environment massive amounts of data are accessed, causing high resource consumption and pressure on database systems; to improve data synchronization and query efficiency the team adopted Presto together with Alluxio.
The offline cluster architecture consists of five layers: BI/reporting services, a Dispatcher routing engine, compute engines (Spark, Hive, Presto), Ranger for fine‑grained permission control, and an ETL/YARN scheduling platform. Dispatcher analyses query size and engine load to route jobs, while Coral converts incompatible SQL syntax and Ranger enforces table‑ and column‑level policies.
Presto queries are routed through a modified Presto‑Gateway that supports multi‑region coordinators and active‑active coordinators, enabling high availability across data centers.
Key issues of the compute‑storage separation model include network overhead, unstable query performance due to slow RPC or hot DataNodes, lack of data locality, and repeated access to hot data.
Alluxio integration addresses three problems: scheme mismatch between HDFS and Alluxio, determining which data to cache, and ensuring consistency with HDFS. The team uses a tag‑based approach, analyzing query lineage to compute table/partition hotness, applying TTL thresholds, and synchronizing metadata via Hive meta‑event listeners to invalidate or refresh cached partitions.
Local cache enhancements embed Alluxio directly in Presto workers, providing page‑level caching, soft‑affinity scheduling, multi‑disk support based on available space, and startup optimizations to preload cache data, while preserving compatibility with HDFS and Viewfs schemes.
Performance tests show that enabling Alluxio reduces query time by roughly 20 % and achieves a cache hit rate of about 40 %; around 30 % of BI workloads now use Alluxio, caching 45 TB of data. Future work includes expanding the local mode to more clusters, supporting TextFile format, implementing disk health detection, and refining soft‑affinity scheduling.
The Q&A clarified the downgrade path using Coral, confirmed the use of PrestoSQL (Trino) 330, and described the tag‑based criteria for selecting tables to route through Alluxio.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
