Optimizing Small Perception Models on Different Compute Cards for Autonomous Driving
This article shares practical experience training perception‑detection mini‑models on two different compute cards, covering environment setup, technical architecture, common dependency issues, performance‑boosting tricks such as CPU process pools, torch dataloader tuning, NCCL P2P handling, and CPFS storage optimization.
Background
With the rapid development of intelligent driving, real‑time perception and decision‑making are critical. Tasks such as object detection, semantic segmentation, and multi‑sensor fusion rely on large deep‑learning models, whose training and deployment demand high compute, memory bandwidth, and energy efficiency. Comparing different compute cards helps customers choose suitable resources for high‑precision perception tasks.
Test Environment Configuration
Environment: mmdet3d, mmcv, flash‑attn, nuscenes‑devkit, torchrun distributed training framework.
Compute: two different cards, referred to as Machine 1 and Machine 2.
Models: maptr, sparsedrive, qcnet, GaussianFormer.
Datasets: nuScenes, Agoverse.
Technical Architecture
The comparison tests run on a PAI DSW instance. The overall training steps are:
Select a DSW image matching the required Python and CUDA versions (e.g., autodrive image with pre‑installed mmcv).
Install model‑specific dependencies according to the official GitHub docs; repeat if errors occur.
Create a CPFS mount point on DSW to store model files and datasets persistently.
Run the training command and log training time and throughput.
Four small perception models were evaluated:
Category
Model
Main Dependencies
Official Repo
Perception‑Map Construction
maptr v1
mmdet3d 0.17.2, mmcv 1.4.0
https://github.com/hustvl/MapTR
Perception‑End‑to‑End
sparsedrive
mmcv 1.7.1, flash‑attn 2.3.2
https://github.com/swc-17/SparseDrive
Perception‑Prediction
QCNet
mmdet3d 0.17.2
https://github.com/ZikangZhou/QCNet
Perception‑Object Detection
GaussianFormer
mmcv 2.0.1, mmdet3d 1.1.1
https://github.com/huang-yh/GaussianFormer
Problems & Solutions
3.1 Environment Dependency Conflicts
For maptr, installing mmcv==1.4.0 failed because it requires torch 1.9.1–1.10.0, which is unavailable on the default DSW image. The solution was to use a Ubuntu bare image with Python 3.8 and CUDA 11.1, install a compatible torch version, then install mmcv.
Collecting mmcv-full==1.4.0
Using cached mmcv_full-1.4.0.tar.gz (2.8 MB)
Preparing metadata (setup.py): started
...
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 13.2 Source Code Adaptation
Different cards may need code adjustments. For example, modify the argument parser to accept --local-rank with a default value.
parser.add_argument('--local-rank', type=int, default=0)3.3 Performance Optimization
3.3.1 CPU Acceleration
Training logs showed low GPU utilization because most time was spent on CPU data preprocessing. Using a process pool and shared memory can speed up CPU work.
from multiprocessing import Pool
def image_process(image):
...
return tensor
if __name__ == '__main__':
with Pool(processes=n) as pool:
results = pool.map(image_process, image_list)
print(results)3.3.2 Torch Application Acceleration
Enable pin_memory and increase num_workers in the DataLoader to overlap CPU‑GPU transfers.
dataloader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4, # number of CPU cores
pin_memory=True
)3.3.3 torch.compile
Use torch.compile (torchdynamo) to trace the model, apply operator fusion, constant folding, and memory layout optimizations, then run with the Inductor backend.
model = torch.compile(
model,
backend='inductor',
dynamic=False,
fullgraph=False
)3.3.4 CPFS Optimization
Checkpoint write latency varied across zones. Using a CPFS in the high‑speed eRDMA‑enabled zone (Ulan C) reduced write time dramatically compared with the default zone (Ulan A).
Test Results
Loss curves show Machine 1 converged around 100 k steps, while Machine 2 converged near 180 k steps, confirming the expected acceleration ratio.
Summary
When training perception‑detection mini‑models in real‑world scenarios, follow these three optimization layers:
Scheduling layer: create a process pool and use shared memory to speed up CPU preprocessing.
Application layer: enable pin_memory, increase num_workers, and apply torch.compile for faster model execution.
Storage layer: store checkpoints on a CPFS with high‑speed eRDMA networking and select the appropriate availability zone.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
