Tagged articles
5 articles
Page 1 of 1
Fun with Large Models
Fun with Large Models
Aug 30, 2025 · Artificial Intelligence

How to Fine‑Tune Large Models on Multiple Nodes and GPUs – A Must‑Know Interview Answer

This article explains how to fine‑tune large models across multiple machines and GPUs by covering data, model, tensor, and pipeline parallelism, hybrid 3D parallel strategies, engineering details such as NCCL, PyTorch Distributed, DeepSpeed, fault‑tolerance, checkpointing, and the ZeRO optimizer stages that dramatically reduce memory usage.

Data ParallelDeepSpeedDistributed Training
0 likes · 8 min read
How to Fine‑Tune Large Models on Multiple Nodes and GPUs – A Must‑Know Interview Answer
Instant Consumer Technology Team
Instant Consumer Technology Team
Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI PerformanceData ParallelGPU inference
0 likes · 11 min read
Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jul 17, 2023 · Artificial Intelligence

How MindSpore’s Auto Parallel Tech Simplifies Large-Model Training

During a livestream titled “Solving the ‘Development Difficulty’ of Large Models with MindSpore Auto Parallel”, Huawei’s MindSpore experts explained how the framework’s distributed training techniques—including data, model, and pipeline parallelism as well as memory‑saving strategies—enable efficient pre‑training of trillion‑parameter models across diverse AI domains.

Data ParallelDistributed TrainingMemory Optimization
0 likes · 6 min read
How MindSpore’s Auto Parallel Tech Simplifies Large-Model Training
DataFunSummit
DataFunSummit
Nov 29, 2021 · Artificial Intelligence

Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

This article reviews Horovod, a popular third‑party distributed deep‑learning training plugin, explaining its simple three‑line integration, the challenges of deadlocks in all‑reduce operations, and the architectural components—including background threads, coordinators, and MPI/Gloo controllers—that enable scalable and efficient data‑parallel training.

Data ParallelDeep LearningDistributed Training
0 likes · 8 min read
Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention
360 Tech Engineering
360 Tech Engineering
May 10, 2019 · Artificial Intelligence

Distributed Training with MXNet: Data Parallel on Single and Multi‑Node GPUs and Integration with Kubeflow

This article explains how MXNet supports data‑parallel training on single‑machine multi‑GPU and multi‑machine multi‑GPU setups, describes KVStore modes, outlines the worker‑server‑scheduler architecture, and shows how to launch large‑scale distributed training using Kubeflow and the mxnet‑operator.

Data ParallelDeep LearningDistributed Training
0 likes · 11 min read
Distributed Training with MXNet: Data Parallel on Single and Multi‑Node GPUs and Integration with Kubeflow