Tagged articles

tensor parallelism

13 articles · Page 1 of 1
DeepHub IMBA
DeepHub IMBA
Jun 23, 2026 · Artificial Intelligence

Parallel Training of 100B‑Parameter Models: Intra‑Node Tensor Parallelism and Inter‑Node Data Parallelism

Training 100‑billion‑parameter Transformers is limited by GPU memory rather than compute, requiring a mix of tensor parallelism within nodes and data parallelism across nodes, along with pipeline parallelism, gradient accumulation, and careful framework choices to balance memory, bandwidth, and compute overheads.

GPU memorydata parallelismdistributed training
0 likes · 14 min read
Parallel Training of 100B‑Parameter Models: Intra‑Node Tensor Parallelism and Inter‑Node Data Parallelism
Raymond Ops
Raymond Ops
Jun 22, 2026 · Artificial Intelligence

Elastic Deployment and GPU Scheduling for Large‑Model Inference with vLLM on Kubernetes

This article presents a detailed, step‑by‑step analysis of deploying the high‑performance vLLM inference engine on Kubernetes, covering GPU memory management, tensor parallelism, quantization choices, continuous batching, and automated scaling with HPA/KEDA to achieve low latency and high throughput for large language models.

DockerGPU schedulingLLM Inference
0 likes · 49 min read
Elastic Deployment and GPU Scheduling for Large‑Model Inference with vLLM on Kubernetes
Baidu Geek Talk
Baidu Geek Talk
Dec 24, 2025 · Artificial Intelligence

Context Parallelism Slashes TTFT by 80% for 128K-Token LLMs

The article explains how Baidu’s Baige team integrated a Context Parallelism strategy into DeepSeek V3.2, detailing the DSA architecture, the limitations of traditional tensor and sequence parallelism, and how CP distributes computation and memory across GPUs to achieve up to an 80 % reduction in token‑to‑first‑token latency for ultra‑long 128K‑token contexts.

Context ParallelismDeepSeekLLM
0 likes · 9 min read
Context Parallelism Slashes TTFT by 80% for 128K-Token LLMs
IT Services Circle
IT Services Circle
Nov 28, 2025 · Artificial Intelligence

Unlocking AI Model Speed: How Data, Pipeline, Tensor & Expert Parallelism Work

AI model training relies on parallel computing, and this guide explains the four main parallelism strategies—Data Parallelism, Pipeline Parallelism, Tensor Parallelism, and Expert Parallelism—detailing their mechanisms, advantages, drawbacks, and how techniques like ZeRO and mixed 3D parallelism optimize memory and performance for massive models.

3D ParallelismAI parallelismExpert Parallelism
0 likes · 14 min read
Unlocking AI Model Speed: How Data, Pipeline, Tensor & Expert Parallelism Work
Instant Consumer Technology Team
Instant Consumer Technology Team
Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI performanceData ParallelGPU inference
0 likes · 11 min read
Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained
Architect
Architect
May 26, 2025 · Artificial Intelligence

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

This article explains the memory limits of a single GPU and systematically introduces data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism, describing their communication costs, advantages, drawbacks, and practical implementation details for training large AI models.

AI trainingExpert Parallelismdata parallelism
0 likes · 14 min read
Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism
AI Algorithm Path
AI Algorithm Path
May 11, 2025 · Artificial Intelligence

How to Parallelize Ultra‑Large Model Training with PyTorch

The article explains the core concepts and trade‑offs of five parallelism techniques—data, tensor, context, pipeline, and expert parallelism—plus the ZeRO optimizer, showing when each method is appropriate for training ultra‑large PyTorch models and providing concrete code snippets and performance considerations.

Context ParallelismExpert ParallelismLarge‑Scale Training
0 likes · 21 min read
How to Parallelize Ultra‑Large Model Training with PyTorch
Zhihu Tech Column
Zhihu Tech Column
Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

GPU parallelismSGLangZhiLight
0 likes · 11 min read
Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 13, 2025 · Artificial Intelligence

Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability

This article analyzes Deepseek's EP‑based inference architecture for V3/R1 models, comparing it with TP, detailing how EP reduces memory and compute overhead, boosts batch size, cuts GPU memory usage, and introduces reliability, scalability, and maintainability challenges for large‑scale deployments.

AI InfrastructureExpert ParallelismGPU memory optimization
0 likes · 18 min read
Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 30, 2024 · Artificial Intelligence

How Sequence Parallelism Slashes Activation Memory in Megatron Training

This article provides a detailed technical walkthrough of sequence parallelism (SP) for Megatron models, covering tensor parallelism basics, precise activation memory calculations for MLP and attention layers, the SP implementation that splits activations across GPUs, and selective activation recomputation strategies that further reduce memory while preserving training speed.

Megatronactivation memoryselective recomputation
0 likes · 20 min read
How Sequence Parallelism Slashes Activation Memory in Megatron Training
Baidu Tech Salon
Baidu Tech Salon
Oct 17, 2024 · Artificial Intelligence

How to Deploy Yuan 2.0 LLM with PaddleNLP: A Step‑by‑Step Guide

This article explains how the open‑source Yuan 2.0 large language model is fully integrated with Baidu’s PaddleNLP, covering its capabilities, fine‑tuning optimizations, step‑by‑step deployment instructions, interaction examples, and training/finetuning results with loss‑curve visualizations.

AILarge Language ModelPaddleNLP
0 likes · 10 min read
How to Deploy Yuan 2.0 LLM with PaddleNLP: A Step‑by‑Step Guide
DeWu Technology
DeWu Technology
May 15, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Deploying a dedicated inference cluster and applying four key optimizations—FlashAttention‑based attention computation, PageAttention KV‑cache management, Mixture‑of‑Experts parameter reduction, and tensor parallelism—can accelerate large language model inference by up to 50% for models as large as 70 B parameters while cutting deployment costs.

FlashAttentionMixture of ExpertsPageAttention
0 likes · 17 min read
Accelerating Large Language Model Inference: Techniques and Framework Recommendations
Alimama Tech
Alimama Tech
Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedGPU OptimizationLLaMA
0 likes · 10 min read
Megatron-LLaMA: High-Performance Large Language Model Training Framework