Big Data 24 min read

Multi‑Cloud Cache Evolution at Zhihu: From Multi‑HDFS to UnionStore to Alluxio

This technical presentation details Zhihu's journey in multi‑cloud caching, covering the motivations for a multi‑cloud architecture, the design and limitations of the self‑built UnionStore component, and the adoption of Alluxio to achieve significant performance, stability, and cost improvements across model serving and training workloads.

DataFunTalk
DataFunTalk
DataFunTalk
Multi‑Cloud Cache Evolution at Zhihu: From Multi‑HDFS to UnionStore to Alluxio

Zhihu adopts a multi‑cloud architecture to improve reliability, capacity and cost, which brings cache design to the forefront of system performance.

The talk outlines four parts: the background of multi‑cloud caching, the self‑developed UnionStore component, its advantages and shortcomings, and the migration to the open‑source Alluxio solution.

UnionStore unifies HDFS and object storage via an object‑storage‑compatible interface, providing automatic caching, consistent file view and reduced storage cost, but it suffers from maintenance complexity, metadata dependence on HDFS and performance issues.

Alluxio was selected for its transparent high‑performance caching, rich access interfaces (S3 proxy and Fuse), and support for both HDFS and object storage. Detailed performance tests show Alluxio’s hot‑read speed can be several times faster than UnionStore.

For model‑reading acceleration, Alluxio S3 proxy is deployed on bare‑metal with short‑circuit reads, real‑time pre‑heating and metadata caching, achieving up to dozens‑fold speedup while protecting network bandwidth.

For model‑training acceleration, Alluxio Fuse is mounted on GPU nodes, leveraging abundant local resources; tuning includes larger cache pages, kernel metadata cache and daemon‑set deployment, resulting in ~60% training time reduction.

Additional use cases such as large‑scale data‑component deployment benefit from Alluxio’s fast object‑storage caching, cutting download speeds from tens of MB/s to hundreds of MB/s.

The overall outcome includes 2‑5× performance improvement, higher stability by removing HDFS dependency, and roughly 50% cost savings, with future plans to apply Alluxio to data orchestration and OLAP acceleration.

performance optimizationBig DataMulti-CloudCachingAlluxio
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.