How Uber Accelerated Presto Queries with Alluxio Local Cache
Uber processes over 500,000 daily Presto queries across 20 clusters handling more than 50 PB of data, and by deploying Alluxio Local Cache on NVMe disks they raised cache‑hit rates from roughly 65% to over 90% while addressing real‑time partition updates, node churn, and cache‑size constraints.
Background
Presto is a core analytics engine at Uber, supporting dashboards, pricing decisions, compliance, growth marketing, and ad‑hoc analysis. The platform serves 9,000 daily active users, runs 500 K queries per day, and scans over 50 PB of data across two data centers with 7,000 nodes and 20 Presto clusters.
Uber Presto Deployment
Current Architecture
UI/Client layer : internal dashboards, Google Data Studio, Tableau, and services that communicate with Presto via JDBC or query parsing.
Routing layer : load‑balances queries to clusters based on statistics such as query count, task count, CPU and memory usage.
Presto clusters : multiple clusters that talk to Hive, HDFS, Pinot, etc., and perform joins across connectors.
Each layer includes internal monitoring and Kerberos support.
Alluxio Local Cache
Alluxio was deployed in three production environments, each with more than 200 nodes, using Alluxio Local Cache mode that leverages the NVMe disks on Presto workers. Only a selected subset of data is cached.
When an external API reads from HDFS, the system first checks the cache; a hit reads directly from the local SSD, otherwise the data is fetched from remote HDFS and cached for future reads. Cache‑hit rate therefore has a major impact on overall performance.
Main Challenges and Solutions
Challenge 1: Real‑time Partition Updates
Uber continuously inserts data into Hudi tables, causing partitions to change. Using only the partition ID as the cache key leads to stale data being served.
Solution: Add Hive modification time to cache key
Old cache key: hdfs://<path> New cache key: hdfs://<path><mod time> Presto obtains the latest modification time via the HDFS listDirectory API during split processing, ensuring that newly updated partitions are cached with the correct timestamp. A narrow race window may still allow a very recent update to be missed, but the impact is limited compared to serving stale data without a cache.
Challenge 2: Cluster Node Changes
Presto’s soft‑affinity scheduler originally used a simple modulo hash (e.g., key % 3 nodes = worker#1). Adding or removing a node reshuffles the hash ring, breaking cache locality and reducing hit rates.
Solution: Consistent hashing based on node ID
All nodes are placed on a virtual ring; the relative order remains stable when nodes join or leave, so the same key maps to the same logical position regardless of cluster size. Replication is also used to improve robustness.
Challenge 3: Cache Size Limits
Each worker has only 500 GB of local SSD, far less than the daily 50 PB data scan. Caching everything is impossible, and aggressive eviction harms performance.
Solution: Cache filter
A configurable filter decides which tables and how many partitions to cache. The static configuration is tuned based on access frequency, table importance, and mutability.
Applying the filter raised the cache‑hit rate from ~65 % to >90 %.
Metadata Optimizations
File‑level Metadata
File‑level metadata records each file’s last modification time and the cached byte‑range. The metadata is persisted on disk so it survives worker restarts.
High‑level Approach
When a file is updated, a new timestamp version is created; the system stores the new page in a new folder and eventually deletes the old timestamp folder.
Cache Data and Metadata Structures
Two timestamped directories (timestamp1, timestamp2) may coexist briefly during high‑concurrency periods. A protobuf‑encoded metadata file tracks the latest timestamp and partition ranges, enabling the local cache to read only the newest data.
Metadata‑aware Cache Context
Alluxio requires the compute engine to pass metadata. Uber added HiveFileContext on the Presto side for each data file. When openFile is called, Alluxio creates a PrestoCacheContext that stores the HiveFileContext, quota, cache identifier (MD5 of file path), and scope (database, schema, table, partition).
During query execution, Presto aggregates metrics such as bytes read from cache versus bytes fetched from HDFS. These metrics flow back through HiveFileContext → CacheContext → RuntimeStats, making the information visible in Presto’s UI or JSON output.
Future Work
Automatic table‑caching via Alluxio Shadow Cache.
Improved handling of rapidly changing Hudi partitions.
Load‑balancing enhancements.
Addressing GC latency caused by longer CacheContext lifetimes.
Implementing a semantic cache based on the file‑level metadata.
Replacing protobuf with flatbuf for faster deserialization.
Conclusion
The Alluxio Local Cache solution has been running in production for over a quarter of Uber’s Presto workload, delivering significant interactive‑query performance gains with minimal maintenance overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Past Memory Big Data
A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
