Accelerate Large‑Scale LLM Training on Alibaba Cloud ACK with DeepSpeed and Arena

This guide explains how to leverage Alibaba Cloud Container Service ACK's AI suite and DeepSpeed to efficiently run distributed large‑language‑model training on Kubernetes, covering prerequisites, configuration, command‑line deployment, monitoring with TensorBoard, and performance‑optimizing techniques.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Accelerate Large‑Scale LLM Training on Alibaba Cloud ACK with DeepSpeed and Arena

Background

With the rapid adoption of ChatGPT, massive language models such as EleutherAI's 20‑billion‑parameter GPT‑NeoX‑20B and BigScience's 176‑billion‑parameter Bloom have emerged. Their size exceeds the memory capacity of a single GPU, making distributed training essential, and DeepSpeed is a key framework that provides mixed‑precision, data‑parallel, model‑parallel, and pipeline‑parallel optimizations.

Solution Overview

Alibaba Cloud Container Service (ACK) offers a Cloud‑Native AI suite that integrates DeepSpeed, enabling users to submit distributed training jobs via the command‑line tool Arena and monitor progress with TensorBoard, all within a Kubernetes cluster.

Core Advantages

Large‑scale heterogeneous resource management : Manage CPU, GPU, FPGA resources in a unified Kubernetes cluster with flexible scheduling and GPU monitoring.

Elastic scaling and cost optimization : Auto‑scale GPU nodes using HPA/VPA, leverage mixed‑node pools (ECS, ECI), and employ checkpointing and failover to reduce costs while maintaining high success rates.

Efficient task scheduling : Arena abstracts data, model, training, evaluation, and serving tasks, offering GPU bin‑packing, custom priorities, and tenant quota control to maximize cluster utilization.

Quick‑Start Prerequisites

Create a GPU‑enabled Kubernetes cluster.

Install the Cloud‑Native AI suite (Arena version ≥ 0.9.6).

Install the Arena client (version ≥ 0.9.6).

Configure a PVC for Arena (or CPFS) to store training data and results.

Usage Instructions

The example trains a masked language model (MLM) using DeepSpeed. The sample code and dataset are pre‑packed into a Docker image; alternatively, you can pull the source from a Git URL and mount a shared NAS storage PVC named training-data for outputs.

Custom Image Options

Build your own image based on the provided Dockerfile (install OpenSSH, etc.).

Use the official DeepSpeed base image: registry.cn-beijing.aliyuncs.com/acs/deepspeed:v072_base.

DeepSpeed Configuration Example

Key settings enable mixed‑precision, ZeRO stage‑1 optimization, and CPU offloading:

{
  "train_micro_batch_size_per_gpu": batch_size,
  "optimizer": {"type": "Adam", "params": {"lr": 1e-4}},
  "fp16": {"enabled": true},
  "zero_optimization": {"stage": 1, "offload_optimizer": {"device": "cpu"}}
}

Submitting the Job

Run the following Arena command to launch a DeepSpeed job with one launcher pod and three worker pods (each using one GPU):

arena submit etjob \
    --name=deepspeed-helloworld \
    --gpus=1 \
    --workers=3 \
    --image=registry.cn-beijing.aliyuncs.com/acs/deepspeed:hello-deepspeed \
    --data=training-data:/data \
    --tensorboard \
    --logdir=/data/deepspeed_data \
    "deepspeed /workspace/DeepSpeedExamples/HelloDeepSpeed/train_bert_ds.py --checkpoint_dir /data/deepspeed_data"

Expected output confirms job creation and provides a command to query status.

Monitoring and Management

Check job details with arena get deepspeed-helloworld, forward TensorBoard to localhost (port 9090) using

kubectl port-forward svc/deepspeed-helloworld-tensorboard 9090:6006

, and view logs via arena logs deepspeed-helloworld. The suite also generates the necessary DeepSpeed launcher and worker pods automatically.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIKubernetesDeepSpeedDistributed TrainingAlibaba CloudArena
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.