Big Data 32 min read

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

This article details Bilibili's end‑to‑end Presto on Hadoop architecture, covering the multi‑engine SQL stack, dispatcher routing, cluster scale, stability enhancements like coordinator HA and real‑time punish, query limits, Hive UDF compatibility, insert‑overwrite support, Alluxio caching, multi‑datacenter routing, query result caching, Raptorx local cache, JDK upgrades, dynamic filtering, and future roadmap, illustrating how these innovations boosted query throughput and reduced latency.

dbaplus Community
dbaplus Community
dbaplus Community
How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

Overall Architecture

Bilibili’s offline platform integrates three compute engines—Presto, Spark, and Hive—with HDFS storage and YARN scheduling. A custom Dispatcher routes SQL queries to the most suitable engine based on query syntax, data size, and engine load. All components are deployed on a unified Kubernetes platform and protected by Apache Ranger for fine‑grained permission control.

Architecture diagram
Architecture diagram

Query Statistics

In ad‑hoc workloads Presto handles ~50 % of queries, while Spark dominates ETL with a 64 % share. Daily query volume reaches ~160 k queries, processing about 10 PB of HDFS data—approximately ten‑fold growth since early 2020.

Query statistics chart
Query statistics chart

Presto Features and Advantages

Presto (now Trino) is an open‑source MPP SQL engine that executes queries fully in memory using a streaming pipeline, avoiding shuffle writes. Key advantages:

Shuffle data never lands on disk.

Streaming execution instead of stage‑based.

Thread‑level split scheduling.

Pluggable data‑source connectors.

Because it lacks task‑level recovery, Presto is less suitable for large‑scale ETL without additional safeguards.

Use Cases

Presto serves ad‑hoc queries, BI reporting, data‑quality checks (DQC), AI‑driven ETL jobs, and data exploration.

Cluster Scale

Seven Presto clusters span two data centers; the largest cluster exceeds 400 nodes, with a total node count over 1,000.

Stability Improvements

1. Coordinator High Availability

Implemented a multi‑active coordinator design. Queries are mapped to a randomly chosen active coordinator and the mapping is stored in Redis to guarantee consistency across retries.

Coordinator HA diagram
Coordinator HA diagram

2. Label‑Based Scheduling

Labels are loaded into memory and refreshed on change. The Dispatcher routes queries to nodes whose labels match the required workload, isolating heavy ad‑hoc queries from other services.

Label scheduling diagram
Label scheduling diagram

3. Real‑Time Punishment

When a resource group exceeds its CPU quota, a punish service notifies all workers to stop scheduling tasks for that group until usage falls below the limit. Pseudocode:

long cSum = lastCSum + usagePerSecond;
if (cSum <= punishCpuLimit) {
    cSum = 0;
} else if (cSum >= 2 * punishCpuLimit) {
    // mark group for punishment
    cSum = cSum - punishCpuLimit;
} else if (punishCpuLimit < cSum && cSum < 2 * punishCpuLimit) {
    cSum = cSum - punishCpuLimit;
}

4. Gateway and Coordinator Refactoring

Support multi‑coordinator scheduling; a query is handled by only one coordinator.

Detect inactive coordinators and exclude them from the dispatch list.

Route queries based on user or job ID to the appropriate data center.

Balance load across coordinators by monitoring memory and job queues.

5. Worker‑Side Enhancements

OOM‑killer service monitors JVM heap usage and kills the largest memory‑consuming task when thresholds are breached.

JMX metrics are exported to Grafana for real‑time monitoring and alerting.

Additional Functional Enhancements

Implicit Type Conversion

Enabled implicit casting similar to Hive. After setting implicit_conversion=true, expressions such as SELECT 1 = '1' succeed.

hive> SELECT 1 = '1';
true

presto> SET SESSION implicit_conversion=true;
SET SESSION

presto> SELECT 1 = '1';
_col0
-------
true

Hive UDF Compatibility

Hive built‑in UDFs and GenericUDFs are loaded via a custom classloader. Missing UDFs are fetched from the Hive metastore, downloaded, and loaded at runtime with thread‑local isolation.

presto> SELECT b_security_mask_email('[email protected]',0);
_col0
------------------
1*[email protected]

INSERT OVERWRITE Support

Added support for INSERT OVERWRITE TABLE and INSERT OVERWRITE DIRECTORY by translating the AST node to an INSERT with an overwrite flag, enabling BI tools to export results directly to HDFS.

presto> INSERT OVERWRITE TABLE tmp_db.tmp_table SELECT '1' AS a, '2' AS b;
INSERT OVERWRITE: 1 row

presto> INSERT OVERWRITE DIRECTORY "/tmp/xxx/insert1" SELECT value FROM ai.xxxTable LIMIT 10;
 rows
------
   10

Ranger Plugin Integration

Extended Ranger 1.2 plugin to understand Presto policies, providing unified table, column, and row‑level masking across Hive, Spark, and Presto.

SQL Hint Support

Session parameters can be embedded as hints, e.g., /*+ query_max_execution_time='1h', scale_writers=true */, to control join types, cache usage, and spill behavior per query.

/*+ query_max_execution_time='1h', scale_writers=true */
SELECT col1, col2 FROM tbl WHERE log_date='20211019'

HAVING and GROUP BY Alias

Presto now permits aliases in HAVING and GROUP BY clauses, fixing errors such as

SELECT log_date, SUM(job_percent) AS percent FROM tbl GROUP BY log_date HAVING percent > 0

.

presto> SELECT log_date, SUM(job_percent) AS percent FROM test.test_report GROUP BY log_date HAVING percent > 0;
log_date | percent
----------+-----------------------
20211231 | 0.03625346663870051

Struct Field Push‑Down

Back‑ported Trino 334’s struct field push‑down, dramatically reducing shuffle size for nested data and yielding multi‑fold performance gains.

SELECT A.ip, B.info.mid FROM tmp_bdp.tmp_struct_test A JOIN tmp_bdp.tmp_struct_test B ON A.ip = B.ip;

JDK Upgrade

Upgraded runtime from JDK 8 to Zulu JDK 11. ZGC offered lower pause times but lacked class‑unloading, so G1 was retained, delivering a ~2 % throughput improvement.

Dynamic Filtering

Implemented runtime‑generated dynamic filters for high‑selectivity joins. In a benchmark, left‑side data read dropped from 6.36 TB to 358 GB. Architecture adds a PredicatePushDown optimizer, a Collect operator on workers, and a DynamicFilterService to aggregate and distribute filter information.

Dynamic filtering diagram
Dynamic filtering diagram

Performance Boosts

Alluxio Caching

Hot tables identified via lineage analysis are cached in Alluxio, reducing HDFS read latency from tens of seconds to sub‑10 ms and delivering 20‑30 % query speedup on TPC‑DS benchmarks.

Alluxio caching performance
Alluxio caching performance

Multi‑Data‑Center Routing

Presto clusters are deployed in two data centers. The gateway parses query tables and partitions, estimates cross‑DC traffic, and routes jobs to the data center where most data resides, applying push‑down optimizations for cross‑DC joins.

Multi‑DC routing diagram
Multi‑DC routing diagram

Query Result Cache

Repeated queries (over 10 k daily) are cached using a Redis key composed of query MD5 and involved tables/partitions. Cached results bypass the scheduler, reducing execution time from ~7 s to ~300 ms.

Query result cache diagram
Query result cache diagram

Raptorx Local Cache

Raptorx introduces a page‑level (1 MB) local cache on workers, using soft‑affinity scheduling to keep hot partitions on the same node. Benchmarks show several‑fold speedups for queries that benefit from cached pages.

Raptorx local cache performance
Raptorx local cache performance

Other Optimizations

Merge small file splits into larger ones to reduce scheduler pressure.

Use HDFS Observer Namenode to mitigate slow RPCs.

Enable FileStatus cache to lower NameNode RPC load.

Activate spill‑to‑disk for large queries.

Cache execution plans on the coordinator.

Automatic phased scheduling for multi‑stage queries.

Improve CBO estimates when table statistics are missing.

Future Roadmap

Introduce Horizontal Pod Autoscaling (HPA) for Presto workers within YARN to auto‑scale during low‑peak periods.

Heuristic indexing to pre‑filter splits before reading.

Automatic materialized view creation and refresh based on frequent query patterns.

Optimizations for complex types (Array/Map) reads.

Route massive ETL jobs to Presto on Spark via HBO to relieve Presto pressure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsPerformance OptimizationBig DataSQLCluster ManagementPrestoHadoop
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.