Big Data 24 min read

Multi‑Cloud Cache Evolution at Zhihu: From Multi‑HDFS to UnionStore to Alluxio

This technical presentation details Zhihu's journey in multi‑cloud caching, covering the motivations for a multi‑cloud architecture, the design and limitations of the self‑built UnionStore component, and the adoption of Alluxio to achieve significant performance, stability, and cost improvements across model serving and training workloads.

DataFunTalk

Jun 25, 2023

Multi‑Cloud Cache Evolution at Zhihu: From Multi‑HDFS to UnionStore to Alluxio

Zhihu adopts a multi‑cloud architecture to improve reliability, capacity and cost, which brings cache design to the forefront of system performance.

The talk outlines four parts: the background of multi‑cloud caching, the self‑developed UnionStore component, its advantages and shortcomings, and the migration to the open‑source Alluxio solution.

UnionStore unifies HDFS and object storage via an object‑storage‑compatible interface, providing automatic caching, consistent file view and reduced storage cost, but it suffers from maintenance complexity, metadata dependence on HDFS and performance issues.

Alluxio was selected for its transparent high‑performance caching, rich access interfaces (S3 proxy and Fuse), and support for both HDFS and object storage. Detailed performance tests show Alluxio’s hot‑read speed can be several times faster than UnionStore.

For model‑reading acceleration, Alluxio S3 proxy is deployed on bare‑metal with short‑circuit reads, real‑time pre‑heating and metadata caching, achieving up to dozens‑fold speedup while protecting network bandwidth.

For model‑training acceleration, Alluxio Fuse is mounted on GPU nodes, leveraging abundant local resources; tuning includes larger cache pages, kernel metadata cache and daemon‑set deployment, resulting in ~60% training time reduction.

Additional use cases such as large‑scale data‑component deployment benefit from Alluxio’s fast object‑storage caching, cutting download speeds from tens of MB/s to hundreds of MB/s.

The overall outcome includes 2‑5× performance improvement, higher stability by removing HDFS dependency, and roughly 50% cost savings, with future plans to apply Alluxio to data orchestration and OLAP acceleration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Multi-Cloud Caching Alluxio

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.