Big Data 32 min read

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

This article details Bilibili's end‑to‑end Presto on Hadoop architecture, covering the multi‑engine SQL stack, dispatcher routing, cluster scale, stability enhancements like coordinator HA and real‑time punish, query limits, Hive UDF compatibility, insert‑overwrite support, Alluxio caching, multi‑datacenter routing, query result caching, Raptorx local cache, JDK upgrades, dynamic filtering, and future roadmap, illustrating how these innovations boosted query throughput and reduced latency.

dbaplus Community

May 12, 2022

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

Overall Architecture

Bilibili’s offline platform integrates three compute engines—Presto, Spark, and Hive—with HDFS storage and YARN scheduling. A custom Dispatcher routes SQL queries to the most suitable engine based on query syntax, data size, and engine load. All components are deployed on a unified Kubernetes platform and protected by Apache Ranger for fine‑grained permission control.

Query Statistics

In ad‑hoc workloads Presto handles ~50 % of queries, while Spark dominates ETL with a 64 % share. Daily query volume reaches ~160 k queries, processing about 10 PB of HDFS data—approximately ten‑fold growth since early 2020.

Presto Features and Advantages

Presto (now Trino) is an open‑source MPP SQL engine that executes queries fully in memory using a streaming pipeline, avoiding shuffle writes. Key advantages:

Shuffle data never lands on disk.

Streaming execution instead of stage‑based.

Thread‑level split scheduling.

Pluggable data‑source connectors.

Because it lacks task‑level recovery, Presto is less suitable for large‑scale ETL without additional safeguards.

Use Cases

Presto serves ad‑hoc queries, BI reporting, data‑quality checks (DQC), AI‑driven ETL jobs, and data exploration.

Cluster Scale

Seven Presto clusters span two data centers; the largest cluster exceeds 400 nodes, with a total node count over 1,000.

Stability Improvements

1. Coordinator High Availability

Implemented a multi‑active coordinator design. Queries are mapped to a randomly chosen active coordinator and the mapping is stored in Redis to guarantee consistency across retries.

2. Label‑Based Scheduling

Labels are loaded into memory and refreshed on change. The Dispatcher routes queries to nodes whose labels match the required workload, isolating heavy ad‑hoc queries from other services.

3. Real‑Time Punishment

When a resource group exceeds its CPU quota, a punish service notifies all workers to stop scheduling tasks for that group until usage falls below the limit. Pseudocode:

long cSum = lastCSum + usagePerSecond;
if (cSum <= punishCpuLimit) {
    cSum = 0;
} else if (cSum >= 2 * punishCpuLimit) {
    // mark group for punishment
    cSum = cSum - punishCpuLimit;
} else if (punishCpuLimit < cSum && cSum < 2 * punishCpuLimit) {
    cSum = cSum - punishCpuLimit;
}

4. Gateway and Coordinator Refactoring

Support multi‑coordinator scheduling; a query is handled by only one coordinator.

Detect inactive coordinators and exclude them from the dispatch list.

Route queries based on user or job ID to the appropriate data center.

Balance load across coordinators by monitoring memory and job queues.

5. Worker‑Side Enhancements

OOM‑killer service monitors JVM heap usage and kills the largest memory‑consuming task when thresholds are breached.

JMX metrics are exported to Grafana for real‑time monitoring and alerting.

Additional Functional Enhancements

Implicit Type Conversion

Enabled implicit casting similar to Hive. After setting implicit_conversion=true, expressions such as SELECT 1 = '1' succeed.

hive> SELECT 1 = '1';
true

presto> SET SESSION implicit_conversion=true;
SET SESSION

presto> SELECT 1 = '1';
_col0
-------
true

Hive UDF Compatibility

Hive built‑in UDFs and GenericUDFs are loaded via a custom classloader. Missing UDFs are fetched from the Hive metastore, downloaded, and loaded at runtime with thread‑local isolation.

presto> SELECT b_security_mask_email('[email protected]',0);
_col0
------------------
1*[email protected]

INSERT OVERWRITE Support

Added support for INSERT OVERWRITE TABLE and INSERT OVERWRITE DIRECTORY by translating the AST node to an INSERT with an overwrite flag, enabling BI tools to export results directly to HDFS.

presto> INSERT OVERWRITE TABLE tmp_db.tmp_table SELECT '1' AS a, '2' AS b;
INSERT OVERWRITE: 1 row

presto> INSERT OVERWRITE DIRECTORY "/tmp/xxx/insert1" SELECT value FROM ai.xxxTable LIMIT 10;
 rows
------
   10

Ranger Plugin Integration

Extended Ranger 1.2 plugin to understand Presto policies, providing unified table, column, and row‑level masking across Hive, Spark, and Presto.

SQL Hint Support

Session parameters can be embedded as hints, e.g., /*+ query_max_execution_time='1h', scale_writers=true */, to control join types, cache usage, and spill behavior per query.

/*+ query_max_execution_time='1h', scale_writers=true */
SELECT col1, col2 FROM tbl WHERE log_date='20211019'

HAVING and GROUP BY Alias

Presto now permits aliases in HAVING and GROUP BY clauses, fixing errors such as

SELECT log_date, SUM(job_percent) AS percent FROM tbl GROUP BY log_date HAVING percent > 0

presto> SELECT log_date, SUM(job_percent) AS percent FROM test.test_report GROUP BY log_date HAVING percent > 0;
log_date | percent
----------+-----------------------
20211231 | 0.03625346663870051

Struct Field Push‑Down

Back‑ported Trino 334’s struct field push‑down, dramatically reducing shuffle size for nested data and yielding multi‑fold performance gains.

SELECT A.ip, B.info.mid FROM tmp_bdp.tmp_struct_test A JOIN tmp_bdp.tmp_struct_test B ON A.ip = B.ip;

JDK Upgrade

Upgraded runtime from JDK 8 to Zulu JDK 11. ZGC offered lower pause times but lacked class‑unloading, so G1 was retained, delivering a ~2 % throughput improvement.

Dynamic Filtering

Implemented runtime‑generated dynamic filters for high‑selectivity joins. In a benchmark, left‑side data read dropped from 6.36 TB to 358 GB. Architecture adds a PredicatePushDown optimizer, a Collect operator on workers, and a DynamicFilterService to aggregate and distribute filter information.

Performance Boosts

Alluxio Caching

Hot tables identified via lineage analysis are cached in Alluxio, reducing HDFS read latency from tens of seconds to sub‑10 ms and delivering 20‑30 % query speedup on TPC‑DS benchmarks.

Multi‑Data‑Center Routing

Presto clusters are deployed in two data centers. The gateway parses query tables and partitions, estimates cross‑DC traffic, and routes jobs to the data center where most data resides, applying push‑down optimizations for cross‑DC joins.

Query Result Cache

Repeated queries (over 10 k daily) are cached using a Redis key composed of query MD5 and involved tables/partitions. Cached results bypass the scheduler, reducing execution time from ~7 s to ~300 ms.

Raptorx Local Cache

Raptorx introduces a page‑level (1 MB) local cache on workers, using soft‑affinity scheduling to keep hot partitions on the same node. Benchmarks show several‑fold speedups for queries that benefit from cached pages.

Other Optimizations

Merge small file splits into larger ones to reduce scheduler pressure.

Use HDFS Observer Namenode to mitigate slow RPCs.

Enable FileStatus cache to lower NameNode RPC load.

Activate spill‑to‑disk for large queries.

Cache execution plans on the coordinator.

Automatic phased scheduling for multi‑stage queries.

Improve CBO estimates when table statistics are missing.

Future Roadmap

Introduce Horizontal Pod Autoscaling (HPA) for Presto workers within YARN to auto‑scale during low‑peak periods.

Heuristic indexing to pre‑filter splits before reading.

Automatic materialized view creation and refresh based on frequent query patterns.

Optimizations for complex types (Array/Map) reads.

Route massive ETL jobs to Presto on Spark via HBO to relieve Presto pressure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Performance Optimization Big Data SQL cluster management presto Hadoop

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overall Architecture

Query Statistics

Presto Features and Advantages

Use Cases

Cluster Scale

Stability Improvements

1. Coordinator High Availability

2. Label‑Based Scheduling

3. Real‑Time Punishment

4. Gateway and Coordinator Refactoring

5. Worker‑Side Enhancements

Additional Functional Enhancements

Implicit Type Conversion

Hive UDF Compatibility

INSERT OVERWRITE Support

Ranger Plugin Integration

SQL Hint Support

HAVING and GROUP BY Alias

Struct Field Push‑Down

JDK Upgrade

Dynamic Filtering

Performance Boosts

Alluxio Caching

Multi‑Data‑Center Routing

Query Result Cache

Raptorx Local Cache

Other Optimizations

Future Roadmap

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

INSERT OVERWRITE Support