Artificial Intelligence 14 min read

How Alibaba Cloud’s PAI Tackles Large‑Model Training and Inference Challenges in 2024

At the 2024 Yunqi Conference, Alibaba Cloud’s AI Infra experts detailed the latest challenges of large‑model deployment—such as hardware costs, resource management, and software‑hardware coordination—and introduced PAI’s new capabilities, including stability tools, automated distributed training, reinforcement‑learning frameworks, inference optimizations, and integrated big‑data AI solutions.

Alibaba Cloud Big Data AI Platform

Sep 26, 2024

How Alibaba Cloud’s PAI Tackles Large‑Model Training and Inference Challenges in 2024

AI Infra Engineering Trend Interpretation

This article is compiled from the 2024 Yunqi Conference transcript. Speakers: Lin Wei, Alibaba Cloud Intelligent Group researcher and head of the AI Platform PAI; Huang Boyuan, senior product expert and PAI product lead. The event focused on AI Infra core technologies and the annual PAI product release.

2024 has seen explosive growth in large‑model development and GenAI applications, bringing challenges such as high hardware costs, resource management, and software‑hardware coordination. Alibaba Cloud’s AI Platform PAI continues to innovate to address these issues.

1. Large‑scale training stability

Ultra‑large training jobs have high error rates, with complex error types like Grey‑failure that slow tasks without causing interruption. As model size grows, fault‑recovery costs increase. PAI addresses stability by building AIMaster with network diagnostics to detect and avoid cluster issues, and by providing the EasyCkpt tool for minute‑level asynchronous checkpointing and on‑demand snapshot delivery.

2. Automatic distributed training for large clusters

PAI released tools to simplify distributed training. For Transformer models, Pai‑Megatron‑Patch (based on Megatron‑LM) offers easy model format conversion and end‑to‑end examples covering pre‑training, fine‑tuning, evaluation, inference, and reinforcement learning. For broader model structures, the self‑developed TorchAcc engine applies operator fusion, communication and memory optimizations, and automatic distribution, built on Torch/XLA and showcased at the 2023 OpenXLA summit. TorchAcc is being integrated into the ModelScope community.

3. Reinforcement learning

RLHF was key to ChatGPT’s breakthrough, but it adds extra training of a reward model, increasing storage and compute demands. PAI built the Alignment training framework ChatLearn, which efficiently supports SFT, RM, RLHF/DPO/Online DPO/GRPO and other alignment methods, and has been open‑sourced since August 2024.

4. Inference service optimization

To make large‑model services affordable, PAI introduced PAI‑BladeLLM, which combines engineering‑level and model‑level optimizations such as automatic mixed‑precision quantization, layer‑wise precision selection, and dynamic computation mode selection, achieving the best balance of accuracy and speed. The scheduling engine Llumnix, featured in OSDI 2024, provides dynamic scheduling for LLM serving.

5. Tight integration of big‑data and AI platforms

Effective model applications require robust data pipelines. PAI leverages big‑data platform capabilities—data cleaning, quality assessment, real‑time updates, and lineage—to support RAG scenarios and ensure high‑quality data for large‑model deployment.

6. Enterprise‑grade capabilities

PAI provides data compliance and security features across training, fine‑tuning, and inference, and collaborates with Alibaba Cloud’s foundational software team and the Longxi community to deliver a Confidential AI solution for secure computation.

Lin Wei noted that PAI has evolved over nearly a decade, accumulating core technologies in scheduling, compilation, distributed runtime, and scenario applications.

PAI Prime offers a comprehensive engineering stack covering AI Infra and application scenarios, aiming to improve training and inference speed, stability, and usability.

Artificial Intelligence Platform PAI Product Annual Release

Huang Boyuan presented major releases across inference, training, development, and trustworthy AI.

1. GenAI‑era inference service

PAI‑EAS was upgraded with the BladeLLM engine, which combines high‑performance operators, quantization, PD‑separated distributed inference, and prompt‑cache optimization, reducing first‑token latency by over 60%, token‑output latency by 70%, and increasing throughput by 80%.

Intelligent routing and dedicated gateways enable dynamic task dispatch to global clusters. PAI‑EAS now serves 16 regions with a cluster exceeding 100 k GPUs.

2. Stable and efficient cloud AI training service

PAI enhanced cluster scheduling with the AI Scheduler, supporting heterogeneous mixed‑resource scheduling, multi‑level quota management, mixed task forms, and seamless task switching, improving resource utilization.

A new bidding task mode offers high‑availability, cost‑effective compute for latency‑insensitive training and exploratory workloads. Comprehensive monitoring, proactive detection, and automatic fault tolerance further boost stability.

3. Integrated big‑data AI platform with best practices

PAI provides a full‑life‑cycle AI data asset service, supporting multimodal cleaning, analysis, intelligent labeling, augmentation, and global model/data lineage. It offers open models, notebooks, and pipeline workflows, plus low‑code tools QuickStart (LLMOps) and ArtLab (AIGC) to lower development barriers.

4. New trustworthy AI capabilities

PAI introduced a trustworthy AI module featuring toxic data cleaning, algorithmic fairness and error detection, confidential‑compute containers, and inappropriate content interception to ensure model and data security.

5. Full‑scale enterprise capabilities

PAI delivers comprehensive enterprise features to manage AI compute resources, developers, permissions, and assets, enabling production‑grade, high‑quality models and applications.

Huang Boyuan emphasized that PAI is a one‑stop platform for enterprises and developers, seamlessly connecting cloud GPU resources with model training and inference services, and continuously evolving to provide low‑cost, accessible AI for all.

References

[1] PAI‑Megatron‑Patch – GitHub

[2] TorchAcc – YouTube

[3] ChatLearn – GitHub

[4] Llumnix – OSDI 2024 paper

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Inference Optimization large models distributed training big data integration AI Infra

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.