Artificial Intelligence 17 min read

Optimizing Small Perception Models on Different Compute Cards for Autonomous Driving

This article shares practical experience training perception‑detection mini‑models on two different compute cards, covering environment setup, technical architecture, common dependency issues, performance‑boosting tricks such as CPU process pools, torch dataloader tuning, NCCL P2P handling, and CPFS storage optimization.

Alibaba Cloud Developer

Jul 24, 2025

Optimizing Small Perception Models on Different Compute Cards for Autonomous Driving

Background

With the rapid development of intelligent driving, real‑time perception and decision‑making are critical. Tasks such as object detection, semantic segmentation, and multi‑sensor fusion rely on large deep‑learning models, whose training and deployment demand high compute, memory bandwidth, and energy efficiency. Comparing different compute cards helps customers choose suitable resources for high‑precision perception tasks.

Test Environment Configuration

Environment: mmdet3d, mmcv, flash‑attn, nuscenes‑devkit, torchrun distributed training framework.

Compute: two different cards, referred to as Machine 1 and Machine 2.

Models: maptr, sparsedrive, qcnet, GaussianFormer.

Datasets: nuScenes, Agoverse.

Technical Architecture

The comparison tests run on a PAI DSW instance. The overall training steps are:

Select a DSW image matching the required Python and CUDA versions (e.g., autodrive image with pre‑installed mmcv).

Install model‑specific dependencies according to the official GitHub docs; repeat if errors occur.

Create a CPFS mount point on DSW to store model files and datasets persistently.

Run the training command and log training time and throughput.

Four small perception models were evaluated:

Problems & Solutions

3.1 Environment Dependency Conflicts

For maptr, installing mmcv==1.4.0 failed because it requires torch 1.9.1–1.10.0, which is unavailable on the default DSW image. The solution was to use a Ubuntu bare image with Python 3.8 and CUDA 11.1, install a compatible torch version, then install mmcv.

Collecting mmcv-full==1.4.0
  Using cached mmcv_full-1.4.0.tar.gz (2.8 MB)
  Preparing metadata (setup.py): started
  ...
  error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1

3.2 Source Code Adaptation

Different cards may need code adjustments. For example, modify the argument parser to accept --local-rank with a default value.

parser.add_argument('--local-rank', type=int, default=0)

3.3 Performance Optimization

3.3.1 CPU Acceleration

Training logs showed low GPU utilization because most time was spent on CPU data preprocessing. Using a process pool and shared memory can speed up CPU work.

from multiprocessing import Pool

def image_process(image):
    ...
    return tensor

if __name__ == '__main__':
    with Pool(processes=n) as pool:
        results = pool.map(image_process, image_list)
        print(results)

3.3.2 Torch Application Acceleration

Enable pin_memory and increase num_workers in the DataLoader to overlap CPU‑GPU transfers.

dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,  # number of CPU cores
    pin_memory=True
)

3.3.3 torch.compile

Use torch.compile (torchdynamo) to trace the model, apply operator fusion, constant folding, and memory layout optimizations, then run with the Inductor backend.

model = torch.compile(
    model,
    backend='inductor',
    dynamic=False,
    fullgraph=False
)

3.3.4 CPFS Optimization

Checkpoint write latency varied across zones. Using a CPFS in the high‑speed eRDMA‑enabled zone (Ulan C) reduced write time dramatically compared with the default zone (Ulan A).

Test Results

Loss curves show Machine 1 converged around 100 k steps, while Machine 2 converged near 180 k steps, confirming the expected acceleration ratio.

Summary

When training perception‑detection mini‑models in real‑world scenarios, follow these three optimization layers:

Scheduling layer: create a process pool and use shared memory to speed up CPU preprocessing.

Application layer: enable pin_memory, increase num_workers, and apply torch.compile for faster model execution.

Storage layer: store checkpoints on a CPFS with high‑speed eRDMA networking and select the appropriate availability zone.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Model Training distributed training autonomous driving Perception torch

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.