Boost Your AI Model Training with Ray on Alibaba Cloud PAI – A Step‑by‑Step Guide

This article introduces the integration of the open‑source distributed AI framework Ray with Alibaba Cloud’s PAI platform, detailing its advantages, architecture, fault‑tolerance, resource management, and provides a comprehensive step‑by‑step tutorial—including configuration, command examples, and code snippets—to efficiently run Ray jobs on PAI‑DLC.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Boost Your AI Model Training with Ray on Alibaba Cloud PAI – A Step‑by‑Step Guide

Post‑Training (model post‑training) is a crucial stage for deploying large models, offering performance improvements with lower resource requirements compared to Pre‑Training.

Ray and PAI Overview

Ray

Ray is an open‑source distributed computing framework widely used in reinforcement learning and large‑model training (e.g., OpenAI’s ChatGPT). It provides components such as Ray Tune, Ray RLlib, Ray Serve, and RaySGD, enabling efficient parallel AI training.

Alibaba Cloud AI Platform (PAI)

PAI (Platform for Artificial Intelligence) is a one‑stop AI development platform offering data annotation, model development, training, inference, and a suite of enterprise‑grade cloud‑native AI capabilities.

Ray on PAI Features

One‑Click Submission

Ray on PAI (via PAI‑DLC) integrates Ray seamlessly, allowing users to submit Ray scripts without manually deploying a Ray cluster or handling Kubernetes configurations.

Unified Scheduling & High Utilization

Submitting Ray jobs to PAI‑DLC leverages PAI’s unified scheduler with network and compute topology awareness, multiple queuing strategies, and multi‑level quota sharing, achieving over 90% cluster utilization.

Fault Tolerance & Reliability

RayNative fault‑tolerance: Application‑level and system‑level mechanisms achieve >90% fault detection and stable RL training for weeks.

High availability: AIMaster elastic fault‑tolerance engine, node self‑healing, and EasyCKPT provide minute‑level node recovery and second‑level checkpoint saving.

Observability: Fine‑grained metric system covering tasks, pods, and GPUs; integration with Ray Dashboard, CloudMonitor, and ARMS.

Enterprise‑Grade Capabilities

Workspace management for teams and resources.

User, role, and permission management.

Fine‑grained resource allocation and monitoring.

Task scheduling policies, alerts, and event notifications.

Asset management for datasets, models, and code.

Integration via PAIFlow workflow and OpenAPI/SDK.

Step‑by‑Step Usage

Log in to the PAI console, navigate to PAI‑DLC, create a new task, and select the Ray framework.

Configure the node image (>=2.6, recommended 2.9) and runtime environment.

Set the entrypoint command, e.g., python /root/code/sample.py.

Note that only the first command is submitted; multiple commands must be chained with ';' or '&&'.

Example of an inline script submission:

echo 'import ray
import os
import requests

ray.init()

@ray.remote
class Counter:
    def __init__(self):
        self.name = os.getenv("counter_name")
        self.counter = 0
    def inc(self):
        self.counter += 1
    def get_counter(self):
        return "{} got {}".format(self.name, self.counter)

counter = Counter.remote()
for _ in range(50000):
    ray.get(counter.inc.remote())
    print(ray.get(counter.get_counter.remote()))' >> sample.py && python sample.py

Configure task nodes: one Head node (always 1) and any number of Worker nodes; a Submitter node handles the entrypoint execution.

Allocate resources; for example, a node with 8 GPUs provides 8 logical GPU resources to the Ray cluster.

Use @ray.remote to specify resources, e.g., @ray.remote(num_gpus=4).

Practical Example

Train a PyTorch ResNet model on GPUs using Ray‑ML image (ray‑ml:2.9.3‑py310‑cu118) through the PAI‑DLC UI, following the steps above.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningAlibaba CloudRayPAI
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.