Big Data 12 min read

Alluxio Local Cache for Presto on S3: Architecture, Implementation, and Performance Evaluation at NewsBreak

This article presents NewsBreak's practical deployment of Alluxio Local Cache with Presto on S3, detailing the system architecture, cache design considerations, implementation steps, performance metrics, and future optimization directions to reduce query latency and storage costs.

DataFunTalk

Mar 3, 2024

Alluxio Local Cache for Presto on S3: Architecture, Implementation, and Performance Evaluation at NewsBreak

Introduction – The session focuses on the practice of Alluxio Local Cache in Presto and S3, a middle layer of Alluxio that enables data locality, accelerates queries, and reduces FS requests and costs.

NewsBreak Architecture – NewsBreak’s data platform consists of multiple data sources feeding into a Data Engine Pipeline (DIP), with schema management, open table formats (Hudi, Iceberg), and transactional databases (Mongo, Scylla, MySQL). Managed ETL via Airflow loads data into a Data & Service layer, split into a raw data lake and an ETL‑oriented warehouse. Presto serves as the query engine for ad‑hoc and BI analysis, while SnowFlake handles sensitive data.

Presto at NewsBreak – Presto operates on a compute‑storage‑separated model, connecting to various connectors (Scylla, Mongo, Iceberg, Hive, MySQL, Hudi). An event‑stream plugin sends query events to Kafka, which are persisted via Hudi for downstream analysis.

Cache Considerations – Key requirements include supporting Presto on S3 (available from Alluxio 2.2.9.3), minimizing impact on existing systems, accelerating queries while lowering S3 request costs, supporting multiple Hive metastore catalogs, applying cache filters to limit cache size (~1% of total storage), handling small files, and providing detailed monitoring.

Implementation Details – The team adapted Presto 0.275 and Alluxio release 0.292 to add S3 support, refined cache filters to catalog level, and introduced a shadow cache for performance measurement. They also leveraged Uber’s HDFS‑API‑style cache hit logic and added per‑catalog configuration.

ALC4PS3 Evaluation – Real‑world tests showed that Alluxio reduced daily request volume to under 10 million, cutting peak bursts from 900 million to a manageable level. Cache hit rates started at 70‑80% and, after applying filters, settled around 20‑30% while maintaining overall storage coverage of 70‑80%.

Multi‑Hive Metastore Support – By treating Alluxio as a singleton manager and allowing per‑catalog cache filter configurations, the solution supports multiple Hive metastore instances (Glue, Iceberg, etc.) without duplicating cached data.

Presto Event Stream Monitoring – A block‑level monitor captures query‑level metrics (hit rate, storage coverage, cache vs. remote reads). Metrics show a reduction of P95 query latency from 9 s to 8 s across two clusters with 1 600 cores, processing ~6 PB of data per month, half of which is read from Alluxio.

Future Work (ALC4PS4) – Although performance gains are modest, the goal remains to improve query speed and cut costs. Challenges include low SQL hit rates (~30%), many small files, and non‑columnar formats. Planned actions involve data‑governance (columnar storage, compression, small‑file handling) and extending the cache mechanism to Flink, Spark, Hudi, and Iceberg.

Q&A – The event‑stream plugin can be built as a Presto listener that publishes query events to Kafka, which are then persisted via Hudi for metric collection. Detailed monitoring currently operates at the query level, providing end‑to‑end performance and cache hit statistics.

Overall, the integration of Alluxio Local Cache with Presto on S3 demonstrates measurable latency reductions and cost savings, while highlighting areas for further optimization in large‑scale data platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Big Data Cache presto Alluxio S3

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.