Tagged articles

data parallelism

11 articles · Page 1 of 1
DeepHub IMBA
DeepHub IMBA
Jun 23, 2026 · Artificial Intelligence

Parallel Training of 100B‑Parameter Models: Intra‑Node Tensor Parallelism and Inter‑Node Data Parallelism

Training 100‑billion‑parameter Transformers is limited by GPU memory rather than compute, requiring a mix of tensor parallelism within nodes and data parallelism across nodes, along with pipeline parallelism, gradient accumulation, and careful framework choices to balance memory, bandwidth, and compute overheads.

GPU memorydata parallelismdistributed training
0 likes · 14 min read
Parallel Training of 100B‑Parameter Models: Intra‑Node Tensor Parallelism and Inter‑Node Data Parallelism
MaGe Linux Operations
MaGe Linux Operations
Jun 18, 2026 · Artificial Intelligence

How to Pick the Right Parallelism for 7B‑70B Models: DP, TP, PP, ZeRO & FSDP

This guide walks engineers through the memory, compute and bandwidth limits of training 7B‑70B models, compares data parallel (DP/DDP), tensor parallel (TP), pipeline parallel (PP), ZeRO stages and FSDP, shows how to calculate GPU memory, estimate communication overhead, configure each strategy, and avoid common pitfalls, enabling you to decide which parallelism to use on multi‑GPU or multi‑node clusters.

DeepSpeedFSDPZeRO
0 likes · 24 min read
How to Pick the Right Parallelism for 7B‑70B Models: DP, TP, PP, ZeRO & FSDP
IT Services Circle
IT Services Circle
Nov 28, 2025 · Artificial Intelligence

Unlocking AI Model Speed: How Data, Pipeline, Tensor & Expert Parallelism Work

AI model training relies on parallel computing, and this guide explains the four main parallelism strategies—Data Parallelism, Pipeline Parallelism, Tensor Parallelism, and Expert Parallelism—detailing their mechanisms, advantages, drawbacks, and how techniques like ZeRO and mixed 3D parallelism optimize memory and performance for massive models.

3D ParallelismAI parallelismExpert Parallelism
0 likes · 14 min read
Unlocking AI Model Speed: How Data, Pipeline, Tensor & Expert Parallelism Work
Architect
Architect
May 26, 2025 · Artificial Intelligence

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

This article explains the memory limits of a single GPU and systematically introduces data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism, describing their communication costs, advantages, drawbacks, and practical implementation details for training large AI models.

AI trainingExpert Parallelismdata parallelism
0 likes · 14 min read
Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism
AI Algorithm Path
AI Algorithm Path
May 11, 2025 · Artificial Intelligence

How to Parallelize Ultra‑Large Model Training with PyTorch

The article explains the core concepts and trade‑offs of five parallelism techniques—data, tensor, context, pipeline, and expert parallelism—plus the ZeRO optimizer, showing when each method is appropriate for training ultra‑large PyTorch models and providing concrete code snippets and performance considerations.

Context ParallelismExpert ParallelismLarge‑Scale Training
0 likes · 21 min read
How to Parallelize Ultra‑Large Model Training with PyTorch
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 28, 2024 · Artificial Intelligence

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

This guide walks you through the fundamentals of distributed training for large AI models, explaining data, model, and pipeline parallelism, GPU communication primitives, and advanced techniques like Megatron 3‑D parallelism and DeepSpeed ZeRO stages, with practical examples and visual illustrations to help you design efficient multi‑GPU training pipelines.

DeepSpeedGPU communicationMegatron
0 likes · 27 min read
Master Distributed Training for Massive AI Models on Multi‑GPU Clusters
DataFunSummit
DataFunSummit
Mar 31, 2024 · Artificial Intelligence

Challenges and Techniques in Distributed Training of Large Language Models

This article reviews the rapid development of large language models since 2019, outlines the historical background, identifies key challenges such as massive compute demand, memory constraints, and system complexity, and then details distributed training technologies—including data parallelism, pipeline parallelism, and advanced optimization strategies—while also discussing future research directions and answering common questions.

AI InfrastructureDeepSpeeddata parallelism
0 likes · 23 min read
Challenges and Techniques in Distributed Training of Large Language Models
360 Smart Cloud
360 Smart Cloud
Jan 26, 2024 · Artificial Intelligence

Parallel Strategies for Distributed Deep Learning Training

This article reviews distributed training techniques for large deep‑learning models, covering data parallelism, model parallelism (including pipeline and tensor parallelism), gradient bucketing and accumulation, 3D parallelism, and practical implementations such as Megatron‑LM and 360AI platform optimizations.

AIMegatron-LMdata parallelism
0 likes · 22 min read
Parallel Strategies for Distributed Deep Learning Training
DataFunTalk
DataFunTalk
Oct 20, 2023 · Artificial Intelligence

Building the ATLAS Automated Machine Learning Platform at Du Xiaoman: Architecture, Practices, and Optimizations

This article describes how Du Xiaoman tackled the high cost, instability, and long cycles of AI algorithm deployment by building the ATLAS automated machine learning platform, detailing its four‑stage workflow, component platforms, scaling and efficiency techniques, and practical Q&A for practitioners.

AI DeploymentAutoMLMachine Learning Platform
0 likes · 22 min read
Building the ATLAS Automated Machine Learning Platform at Du Xiaoman: Architecture, Practices, and Optimizations
Tencent Architect
Tencent Architect
Jul 29, 2021 · Artificial Intelligence

Performance Optimization of Advertising Coarse‑Ranking Training on the Light Framework

This article analyzes the bottlenecks of advertising coarse‑ranking training on the Light framework and presents a series of optimizations—including parallel data download, thread‑queue buffering, integer‑to‑string conversion with fmt, and zlib replacement with czlib—that together achieve up to 58% QPS improvement and notable CPU efficiency gains.

AdvertisingCPU/GPU efficiencyPerformance Optimization
0 likes · 11 min read
Performance Optimization of Advertising Coarse‑Ranking Training on the Light Framework
21CTO
21CTO
Sep 19, 2015 · Artificial Intelligence

Why Distributed Machine Learning Needs More Data Than Speed

The article explains how distributed machine learning evolved from parallel computing to handle massive, long‑tail data sets, discusses the importance of scalability, fault recovery, and data‑parallel algorithms, and reviews frameworks such as MPI, MapReduce, and Pregel for building large‑scale AI systems.

Big DataLDAMPI
0 likes · 24 min read
Why Distributed Machine Learning Needs More Data Than Speed