Big Data 15 min read

Alluxio Deployment at Ant Group: Stability Building, Performance Optimization, and Scale‑up for Large‑Scale Model Training

This article summarizes how Ant Group introduced Alluxio to address storage I/O, capacity, and latency challenges in large‑scale model training, detailing stability improvements through worker‑register follower and master migration, performance gains via follower‑only reads, and horizontal scaling using metadata sharding and multi‑cluster deployment.

DataFunTalk

Feb 15, 2023

Alluxio Deployment at Ant Group: Stability Building, Performance Optimization, and Scale‑up for Large‑Scale Model Training

Ant Group adopted Alluxio to overcome three core challenges in large‑scale GPU model training: storage I/O bottlenecks, single‑node capacity limits, and network latency, seeking a solution that combines high throughput, concurrency, and low latency.

Stability Building focuses on two areas: worker‑register follower and master migration. By registering each worker with all masters and maintaining a heartbeat between primary and workers, the fail‑over (FO) time can be reduced to under 30 seconds, minimizing user‑visible errors. Master migration issues are solved by dynamically updating workers with the current master set via a primary‑worker heartbeat.

Performance Optimization introduces a follower‑read‑only mode. After the initial metadata warm‑up, standby masters serve read‑only requests without affecting Raft journal entries, allowing three‑fold throughput improvements and better utilization of standby resources for read‑heavy workloads.

Scale‑up is achieved through horizontal expansion and metadata sharding. By partitioning metadata across multiple clusters and routing client requests via a proxy that hashes keys to specific shards, the system can support billions of files, alleviate memory pressure on block and file masters, and increase overall QPS and throughput.

The combined optimizations—stability, performance, and scaling—enable Ant Group to support ever‑growing model training workloads with reduced fail‑over times, higher throughput, and the ability to handle massive data volumes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Model Training Stability Alluxio

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.