How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains
This article details Bilibili's end‑to‑end Presto on Hadoop architecture, covering the multi‑engine SQL stack, dispatcher routing, cluster scale, stability enhancements like coordinator HA and real‑time punish, query limits, Hive UDF compatibility, insert‑overwrite support, Alluxio caching, multi‑datacenter routing, query result caching, Raptorx local cache, JDK upgrades, dynamic filtering, and future roadmap, illustrating how these innovations boosted query throughput and reduced latency.
Overall Architecture
Bilibili’s offline platform integrates three compute engines—Presto, Spark, and Hive—with HDFS storage and YARN scheduling. A custom Dispatcher routes SQL queries to the most suitable engine based on query syntax, data size, and engine load. All components are deployed on a unified Kubernetes platform and protected by Apache Ranger for fine‑grained permission control.
Query Statistics
In ad‑hoc workloads Presto handles ~50 % of queries, while Spark dominates ETL with a 64 % share. Daily query volume reaches ~160 k queries, processing about 10 PB of HDFS data—approximately ten‑fold growth since early 2020.
Presto Features and Advantages
Presto (now Trino) is an open‑source MPP SQL engine that executes queries fully in memory using a streaming pipeline, avoiding shuffle writes. Key advantages:
Shuffle data never lands on disk.
Streaming execution instead of stage‑based.
Thread‑level split scheduling.
Pluggable data‑source connectors.
Because it lacks task‑level recovery, Presto is less suitable for large‑scale ETL without additional safeguards.
Use Cases
Presto serves ad‑hoc queries, BI reporting, data‑quality checks (DQC), AI‑driven ETL jobs, and data exploration.
Cluster Scale
Seven Presto clusters span two data centers; the largest cluster exceeds 400 nodes, with a total node count over 1,000.
Stability Improvements
1. Coordinator High Availability
Implemented a multi‑active coordinator design. Queries are mapped to a randomly chosen active coordinator and the mapping is stored in Redis to guarantee consistency across retries.
2. Label‑Based Scheduling
Labels are loaded into memory and refreshed on change. The Dispatcher routes queries to nodes whose labels match the required workload, isolating heavy ad‑hoc queries from other services.
3. Real‑Time Punishment
When a resource group exceeds its CPU quota, a punish service notifies all workers to stop scheduling tasks for that group until usage falls below the limit. Pseudocode:
long cSum = lastCSum + usagePerSecond;
if (cSum <= punishCpuLimit) {
cSum = 0;
} else if (cSum >= 2 * punishCpuLimit) {
// mark group for punishment
cSum = cSum - punishCpuLimit;
} else if (punishCpuLimit < cSum && cSum < 2 * punishCpuLimit) {
cSum = cSum - punishCpuLimit;
}4. Gateway and Coordinator Refactoring
Support multi‑coordinator scheduling; a query is handled by only one coordinator.
Detect inactive coordinators and exclude them from the dispatch list.
Route queries based on user or job ID to the appropriate data center.
Balance load across coordinators by monitoring memory and job queues.
5. Worker‑Side Enhancements
OOM‑killer service monitors JVM heap usage and kills the largest memory‑consuming task when thresholds are breached.
JMX metrics are exported to Grafana for real‑time monitoring and alerting.
Additional Functional Enhancements
Implicit Type Conversion
Enabled implicit casting similar to Hive. After setting implicit_conversion=true, expressions such as SELECT 1 = '1' succeed.
hive> SELECT 1 = '1';
true
presto> SET SESSION implicit_conversion=true;
SET SESSION
presto> SELECT 1 = '1';
_col0
-------
trueHive UDF Compatibility
Hive built‑in UDFs and GenericUDFs are loaded via a custom classloader. Missing UDFs are fetched from the Hive metastore, downloaded, and loaded at runtime with thread‑local isolation.
presto> SELECT b_security_mask_email('[email protected]',0);
_col0
------------------
1*[email protected]INSERT OVERWRITE Support
Added support for INSERT OVERWRITE TABLE and INSERT OVERWRITE DIRECTORY by translating the AST node to an INSERT with an overwrite flag, enabling BI tools to export results directly to HDFS.
presto> INSERT OVERWRITE TABLE tmp_db.tmp_table SELECT '1' AS a, '2' AS b;
INSERT OVERWRITE: 1 row
presto> INSERT OVERWRITE DIRECTORY "/tmp/xxx/insert1" SELECT value FROM ai.xxxTable LIMIT 10;
rows
------
10Ranger Plugin Integration
Extended Ranger 1.2 plugin to understand Presto policies, providing unified table, column, and row‑level masking across Hive, Spark, and Presto.
SQL Hint Support
Session parameters can be embedded as hints, e.g., /*+ query_max_execution_time='1h', scale_writers=true */, to control join types, cache usage, and spill behavior per query.
/*+ query_max_execution_time='1h', scale_writers=true */
SELECT col1, col2 FROM tbl WHERE log_date='20211019'HAVING and GROUP BY Alias
Presto now permits aliases in HAVING and GROUP BY clauses, fixing errors such as
SELECT log_date, SUM(job_percent) AS percent FROM tbl GROUP BY log_date HAVING percent > 0.
presto> SELECT log_date, SUM(job_percent) AS percent FROM test.test_report GROUP BY log_date HAVING percent > 0;
log_date | percent
----------+-----------------------
20211231 | 0.03625346663870051Struct Field Push‑Down
Back‑ported Trino 334’s struct field push‑down, dramatically reducing shuffle size for nested data and yielding multi‑fold performance gains.
SELECT A.ip, B.info.mid FROM tmp_bdp.tmp_struct_test A JOIN tmp_bdp.tmp_struct_test B ON A.ip = B.ip;JDK Upgrade
Upgraded runtime from JDK 8 to Zulu JDK 11. ZGC offered lower pause times but lacked class‑unloading, so G1 was retained, delivering a ~2 % throughput improvement.
Dynamic Filtering
Implemented runtime‑generated dynamic filters for high‑selectivity joins. In a benchmark, left‑side data read dropped from 6.36 TB to 358 GB. Architecture adds a PredicatePushDown optimizer, a Collect operator on workers, and a DynamicFilterService to aggregate and distribute filter information.
Performance Boosts
Alluxio Caching
Hot tables identified via lineage analysis are cached in Alluxio, reducing HDFS read latency from tens of seconds to sub‑10 ms and delivering 20‑30 % query speedup on TPC‑DS benchmarks.
Multi‑Data‑Center Routing
Presto clusters are deployed in two data centers. The gateway parses query tables and partitions, estimates cross‑DC traffic, and routes jobs to the data center where most data resides, applying push‑down optimizations for cross‑DC joins.
Query Result Cache
Repeated queries (over 10 k daily) are cached using a Redis key composed of query MD5 and involved tables/partitions. Cached results bypass the scheduler, reducing execution time from ~7 s to ~300 ms.
Raptorx Local Cache
Raptorx introduces a page‑level (1 MB) local cache on workers, using soft‑affinity scheduling to keep hot partitions on the same node. Benchmarks show several‑fold speedups for queries that benefit from cached pages.
Other Optimizations
Merge small file splits into larger ones to reduce scheduler pressure.
Use HDFS Observer Namenode to mitigate slow RPCs.
Enable FileStatus cache to lower NameNode RPC load.
Activate spill‑to‑disk for large queries.
Cache execution plans on the coordinator.
Automatic phased scheduling for multi‑stage queries.
Improve CBO estimates when table statistics are missing.
Future Roadmap
Introduce Horizontal Pod Autoscaling (HPA) for Presto workers within YARN to auto‑scale during low‑peak periods.
Heuristic indexing to pre‑filter splits before reading.
Automatic materialized view creation and refresh based on frequent query patterns.
Optimizations for complex types (Array/Map) reads.
Route massive ETL jobs to Presto on Spark via HBO to relieve Presto pressure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
