Tagged articles
11 articles
Page 1 of 1
Baidu Geek Talk
Baidu Geek Talk
Dec 24, 2025 · Artificial Intelligence

Context Parallelism Slashes TTFT by 80% for 128K-Token LLMs

The article explains how Baidu’s Baige team integrated a Context Parallelism strategy into DeepSeek V3.2, detailing the DSA architecture, the limitations of traditional tensor and sequence parallelism, and how CP distributes computation and memory across GPUs to achieve up to an 80 % reduction in token‑to‑first‑token latency for ultra‑long 128K‑token contexts.

Context ParallelismDeepSeekLLM
0 likes · 9 min read
Context Parallelism Slashes TTFT by 80% for 128K-Token LLMs
IT Services Circle
IT Services Circle
Nov 28, 2025 · Artificial Intelligence

Unlocking AI Model Speed: How Data, Pipeline, Tensor & Expert Parallelism Work

AI model training relies on parallel computing, and this guide explains the four main parallelism strategies—Data Parallelism, Pipeline Parallelism, Tensor Parallelism, and Expert Parallelism—detailing their mechanisms, advantages, drawbacks, and how techniques like ZeRO and mixed 3D parallelism optimize memory and performance for massive models.

3D ParallelismAI parallelismData Parallelism
0 likes · 14 min read
Unlocking AI Model Speed: How Data, Pipeline, Tensor & Expert Parallelism Work
Instant Consumer Technology Team
Instant Consumer Technology Team
Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI PerformanceData ParallelGPU inference
0 likes · 11 min read
Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained
Architect
Architect
May 26, 2025 · Artificial Intelligence

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

This article explains the memory limits of a single GPU and systematically introduces data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism, describing their communication costs, advantages, drawbacks, and practical implementation details for training large AI models.

AI trainingData ParallelismExpert Parallelism
0 likes · 14 min read
Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism
AI Algorithm Path
AI Algorithm Path
May 11, 2025 · Artificial Intelligence

How to Parallelize Ultra‑Large Model Training with PyTorch

The article explains the core concepts and trade‑offs of five parallelism techniques—data, tensor, context, pipeline, and expert parallelism—plus the ZeRO optimizer, showing when each method is appropriate for training ultra‑large PyTorch models and providing concrete code snippets and performance considerations.

Context ParallelismData ParallelismExpert Parallelism
0 likes · 21 min read
How to Parallelize Ultra‑Large Model Training with PyTorch
Zhihu Tech Column
Zhihu Tech Column
Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

GPU parallelismSGLangTensor Parallelism
0 likes · 11 min read
Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 13, 2025 · Artificial Intelligence

Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability

This article analyzes Deepseek's EP‑based inference architecture for V3/R1 models, comparing it with TP, detailing how EP reduces memory and compute overhead, boosts batch size, cuts GPU memory usage, and introduces reliability, scalability, and maintainability challenges for large‑scale deployments.

AI InfrastructureExpert ParallelismGPU memory optimization
0 likes · 18 min read
Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 30, 2024 · Artificial Intelligence

How Sequence Parallelism Slashes Activation Memory in Megatron Training

This article provides a detailed technical walkthrough of sequence parallelism (SP) for Megatron models, covering tensor parallelism basics, precise activation memory calculations for MLP and attention layers, the SP implementation that splits activations across GPUs, and selective activation recomputation strategies that further reduce memory while preserving training speed.

MegatronTensor Parallelismactivation memory
0 likes · 20 min read
How Sequence Parallelism Slashes Activation Memory in Megatron Training
Baidu Tech Salon
Baidu Tech Salon
Oct 17, 2024 · Artificial Intelligence

How to Deploy Yuan 2.0 LLM with PaddleNLP: A Step‑by‑Step Guide

This article explains how the open‑source Yuan 2.0 large language model is fully integrated with Baidu’s PaddleNLP, covering its capabilities, fine‑tuning optimizations, step‑by‑step deployment instructions, interaction examples, and training/finetuning results with loss‑curve visualizations.

AIDistributed TrainingFine-tuning
0 likes · 10 min read
How to Deploy Yuan 2.0 LLM with PaddleNLP: A Step‑by‑Step Guide
DeWu Technology
DeWu Technology
May 15, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Deploying a dedicated inference cluster and applying four key optimizations—FlashAttention‑based attention computation, PageAttention KV‑cache management, Mixture‑of‑Experts parameter reduction, and tensor parallelism—can accelerate large language model inference by up to 50% for models as large as 70 B parameters while cutting deployment costs.

FlashAttentionInference AccelerationMixture of Experts
0 likes · 17 min read
Accelerating Large Language Model Inference: Techniques and Framework Recommendations
Alimama Tech
Alimama Tech
Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedDistributed TrainingGPU Optimization
0 likes · 10 min read
Megatron-LLaMA: High-Performance Large Language Model Training Framework