Big Data 15 min read

How Uber Accelerated Presto Queries with Alluxio Local Cache

Uber processes over 500,000 daily Presto queries across 20 clusters handling more than 50 PB of data, and by deploying Alluxio Local Cache on NVMe disks they raised cache‑hit rates from roughly 65% to over 90% while addressing real‑time partition updates, node churn, and cache‑size constraints.

Past Memory Big Data

Nov 15, 2022

How Uber Accelerated Presto Queries with Alluxio Local Cache

Background

Presto is a core analytics engine at Uber, supporting dashboards, pricing decisions, compliance, growth marketing, and ad‑hoc analysis. The platform serves 9,000 daily active users, runs 500 K queries per day, and scans over 50 PB of data across two data centers with 7,000 nodes and 20 Presto clusters.

Uber Presto Deployment

Current Architecture

UI/Client layer : internal dashboards, Google Data Studio, Tableau, and services that communicate with Presto via JDBC or query parsing.

Routing layer : load‑balances queries to clusters based on statistics such as query count, task count, CPU and memory usage.

Presto clusters : multiple clusters that talk to Hive, HDFS, Pinot, etc., and perform joins across connectors.

Each layer includes internal monitoring and Kerberos support.

Alluxio Local Cache

Alluxio was deployed in three production environments, each with more than 200 nodes, using Alluxio Local Cache mode that leverages the NVMe disks on Presto workers. Only a selected subset of data is cached.

When an external API reads from HDFS, the system first checks the cache; a hit reads directly from the local SSD, otherwise the data is fetched from remote HDFS and cached for future reads. Cache‑hit rate therefore has a major impact on overall performance.

Main Challenges and Solutions

Challenge 1: Real‑time Partition Updates

Uber continuously inserts data into Hudi tables, causing partitions to change. Using only the partition ID as the cache key leads to stale data being served.

Solution: Add Hive modification time to cache key

Old cache key: hdfs://<path> New cache key: hdfs://<path><mod time> Presto obtains the latest modification time via the HDFS listDirectory API during split processing, ensuring that newly updated partitions are cached with the correct timestamp. A narrow race window may still allow a very recent update to be missed, but the impact is limited compared to serving stale data without a cache.

Challenge 2: Cluster Node Changes

Presto’s soft‑affinity scheduler originally used a simple modulo hash (e.g., key % 3 nodes = worker#1). Adding or removing a node reshuffles the hash ring, breaking cache locality and reducing hit rates.

Solution: Consistent hashing based on node ID

All nodes are placed on a virtual ring; the relative order remains stable when nodes join or leave, so the same key maps to the same logical position regardless of cluster size. Replication is also used to improve robustness.

Challenge 3: Cache Size Limits

Each worker has only 500 GB of local SSD, far less than the daily 50 PB data scan. Caching everything is impossible, and aggressive eviction harms performance.

Solution: Cache filter

A configurable filter decides which tables and how many partitions to cache. The static configuration is tuned based on access frequency, table importance, and mutability.

Applying the filter raised the cache‑hit rate from ~65 % to >90 %.

Metadata Optimizations

File‑level Metadata

File‑level metadata records each file’s last modification time and the cached byte‑range. The metadata is persisted on disk so it survives worker restarts.

High‑level Approach

When a file is updated, a new timestamp version is created; the system stores the new page in a new folder and eventually deletes the old timestamp folder.

Cache Data and Metadata Structures

Two timestamped directories (timestamp1, timestamp2) may coexist briefly during high‑concurrency periods. A protobuf‑encoded metadata file tracks the latest timestamp and partition ranges, enabling the local cache to read only the newest data.

Metadata‑aware Cache Context

Alluxio requires the compute engine to pass metadata. Uber added HiveFileContext on the Presto side for each data file. When openFile is called, Alluxio creates a PrestoCacheContext that stores the HiveFileContext, quota, cache identifier (MD5 of file path), and scope (database, schema, table, partition).

During query execution, Presto aggregates metrics such as bytes read from cache versus bytes fetched from HDFS. These metrics flow back through HiveFileContext → CacheContext → RuntimeStats, making the information visible in Presto’s UI or JSON output.

Future Work

Automatic table‑caching via Alluxio Shadow Cache.

Improved handling of rapidly changing Hudi partitions.

Load‑balancing enhancements.

Addressing GC latency caused by longer CacheContext lifetimes.

Implementing a semantic cache based on the file‑level metadata.

Replacing protobuf with flatbuf for faster deserialization.

Conclusion

The Alluxio Local Cache solution has been running in production for over a quarter of Uber’s Presto workload, delivering significant interactive‑query performance gains with minimal maintenance overhead.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance optimization Big Data metadata Local Cache presto Alluxio Consistent Hashing

Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.