How TAI Platform Optimizes Large‑Model Scheduling and Fault Recovery on Kubernetes
This article explains how the TAI platform leverages Kubernetes and Volcano to tackle fault, efficiency, and usability challenges in large‑model training and inference, detailing custom resources, automated fault detection, and advanced scheduling strategies that boost resource utilization and performance.
Since 2023, large models have entered an explosive growth phase, attracting attention for their strong understanding and reasoning abilities, while training and inference present significant technical challenges.
The TAI platform on Zhihui Cloud supports the full lifecycle of internal large‑model development, training, inference, and data processing, handling over 6,000 jobs per month from more than 3,000 departments across ten clusters and 1,000+ nodes, utilizing CPU, memory, GPU, and NPU resources.
Key challenges include faults, efficiency, and usability.
Usability improvements are achieved by integrating Volcano custom resources (vcjob) to extend native Kubernetes capabilities, enabling multi‑role jobs, better support for PyTorch and MPI, SSH‑less login, and efficient inter‑instance communication, while leveraging Volcano's native queuing, priority, and preemption features.
Fault handling is critical; hardware failures such as GPU, NIC, power, storage, and cooling issues cause frequent interruptions. The platform developed Qihoo‑SMI, a fault detection and self‑healing tool that works with Volcano to automatically detect, classify, and remediate faults. It can detect 90% of hardware faults and automatically fix 15% of them, blacklisting faulty nodes and notifying operators as needed.
When Qihoo‑SMI blacklists a node, Volcano automatically retries the job, rescheduling instances to healthy nodes without user intervention, and resumes training from the latest checkpoint.
To boost efficiency , the platform addresses bottlenecks in job creation, scheduling, and cleanup. It reduces pressure on the Kubernetes API server and etcd, improves scheduler throughput for placing many instances, accelerates image pulling, and enhances long‑running high‑load stability, achieving MFU > 40% and resource fragmentation as low as 1%.
By extending Volcano with custom strategies such as delayed scheduling, the platform ensures high‑priority jobs obtain resources promptly, even when lower‑priority jobs have partially released resources, preserving priority‑based queuing.
Overall, the TAI platform combines Volcano's native capabilities with proprietary enhancements to deliver fault‑tolerant, efficient, and user‑friendly large‑model infrastructure.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.