How to Run Distributed PyTorch Training on AzureML with CLI v2
This article walks through the complete workflow for building, testing, and launching a distributed PyTorch training job on AzureML using the CLI v2, covering local script preparation, Accelerate configuration, Docker environment setup, dataset registration, compute target definition, job YAML creation, and job submission with monitoring.
This guide demonstrates how to construct a distributed training job on Azure Machine Learning (AzureML) using the CLI v2, targeting the Imagenette image‑classification dataset and a ResNet‑RS50 model built with PyTorch‑accelerated.
Local preparation : First download the Imagenette dataset, create a data folder, and write the training script train_imagenette/train.py. The script imports torch, timm, defines data transforms, creates a resnetrs50 model, sets up an AdamW optimizer, a OneCycleLR scheduler, and assembles a Trainer with callbacks such as AccuracyCallback, EarlyStoppingCallback, and SaveBestModelCallback. The script parses --data_dir and --epochs arguments and runs training and evaluation.
# train_imagenette/train.py
import argparse, os
from pathlib import Path
import torch, torch.nn as nn
from timm import create_model
from torch.optim.lr_scheduler import OneCycleLR
from torchmetrics import Accuracy
from torchvision import transforms, datasets
from pytorch_accelerated.trainer import Trainer, TrainerPlaceholderValues
# ... (rest of script omitted for brevity)Local verification with Accelerate : Generate an accelerate_config.yaml using accelerate config, then launch locally with
accelerate launch --config_file train_imagenette/accelerate_config.yaml train_imagenette/train.py --data_dir data/imagenette2-320 --epochs 1. Running with two GPUs halves the number of steps per epoch, confirming correct distributed behavior.
AzureML distributed training : AzureML offers two entry points—CLI v2 and Python SDK. This tutorial uses CLI v2. Prerequisites include an Azure subscription, Azure CLI, an AzureML workspace, and the AzureML CLI v2 installation.
Logging with MLflow : A custom callback AzureMLLoggerCallback subclasses LogMetricsCallback to send metrics to MLflow. The callback sets the tracking URI from MLFLOW_TRACKING_URI, logs run tags, and records metrics only from the world‑process‑zero node.
from pytorch_accelerated.callbacks import LogMetricsCallback
import mlflow, os
class AzureMLLoggerCallback(LogMetricsCallback):
def __init__(self):
mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_URI'])
def on_training_run_start(self, trainer, **kwargs):
mlflow.set_tags(trainer.run_config.to_dict())
def log_metrics(self, trainer, metrics):
if trainer.run_config.is_world_process_zero:
mlflow.log_metrics(metrics)Docker environment : Define a custom Dockerfile based on pytorch/pytorch:1.10.0‑cuda11.3‑cudnn8‑runtime and install AzureML packages, MLflow, and pytorch‑accelerated extras.
# train_imagenette/docker/Dockerfile
FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
RUN pip install 'azureml-core==1.35.0.post1' \
'azureml-defaults==1.35.0' \
'azureml-mlflow==1.35.0' \
'azureml-telemetry==1.35.0' \
'mlflow-skinny' \
'pytorch-accelerated[examples]>=0.1.8'Dataset registration : Create data/register_dataset.yaml describing the Imagenette subset and register it with
az ml dataset create -f data/register_dataset.yaml --resource-group myResourceGroup --workspace-name myWorkspace.
Compute target : Define infra/create_compute_target.yaml for a STANDARD_NV24 GPU cluster (4 × Tesla M60) and create it via
az ml compute create -f infra/create_compute_target.yaml --resource-group myResourceGroup --workspace-name myWorkspace.
Job configuration : The training job YAML train_imagenette/train_config.yaml specifies the command (accelerate launch with placeholders), inputs (epochs, number of machines/processes, dataset), environment build path (Dockerfile directory), compute target, and distribution type pytorch with process_count_per_instance: 1. It also sets instance_count: 1 by default.
# train_imagenette/train_config.yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
experiment_name: train-imagenette
command: >-
accelerate launch \
--config_file accelerate_config.yaml \
--num_machines ${{inputs.num_machines}} \
--num_processes ${{inputs.num_processes}} \
train.py \
--epochs ${{inputs.epochs}} \
--data_dir ${{inputs.imagenette}}
inputs:
epochs: 30
num_machines: 1
num_processes: 1
imagenette:
dataset: azureml:imagenette2-320:1
environment:
build:
local_path: ./docker
code:
local_path: .
compute: azureml:gpu-cluster
distribution:
type: pytorch
process_count_per_instance: 1
resources:
instance_count: 1Job submission : Launch the job with
az ml job create -f train_imagenette/train_config.yaml --set inputs.num_machines=2 inputs.num_processes=8 resources.instance_count=2 --resource-group myResourceGroup --workspace-name myWorkspace. The UI shows the experiment, charts of logged metrics, and logs under the Outputs + Logs tab, including the saved model checkpoint.
This end‑to‑end example illustrates how to move from a local PyTorch training script to a fully reproducible, scalable AzureML distributed training job using CLI v2.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
