Self-Forcing: Turning Global Video Diffusion into Causal Streaming for Long-Form Generation

This article examines the Wan2.1 video diffusion model, identifies its scalability bottlenecks for long and real‑time video generation, and introduces the Self‑Forcing causal framework together with sequence‑parallel and RoPE optimizations that achieve sub‑second latency and up to 1.5× speed‑up on modern GPUs.

GPU Optimizationcausal inferencelarge video generation

0 likes · 14 min read

Self-Forcing: Turning Global Video Diffusion into Causal Streaming for Long-Form Generation

Baidu Geek Talk

Dec 24, 2025 · Artificial Intelligence

Context Parallelism Slashes TTFT by 80% for 128K-Token LLMs

The article explains how Baidu’s Baige team integrated a Context Parallelism strategy into DeepSeek V3.2, detailing the DSA architecture, the limitations of traditional tensor and sequence parallelism, and how CP distributes computation and memory across GPUs to achieve up to an 80 % reduction in token‑to‑first‑token latency for ultra‑long 128K‑token contexts.

Context ParallelismDeepSeekLLM

0 likes · 9 min read

Context Parallelism Slashes TTFT by 80% for 128K-Token LLMs

Architect

May 26, 2025 · Artificial Intelligence

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

This article explains the memory limits of a single GPU and systematically introduces data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism, describing their communication costs, advantages, drawbacks, and practical implementation details for training large AI models.

AI trainingData ParallelismExpert Parallelism

0 likes · 14 min read

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

Baobao Algorithm Notes

Nov 4, 2024 · Artificial Intelligence

How DeepSpeed Ulysses Cuts Communication Overhead Compared to Megatron

This article provides a detailed technical analysis of DeepSpeed Ulysses, explaining its sequence‑parallel workflow, comparing its communication volume with Megatron, and examining how All2All operations and Zero‑3 integration affect scalability and efficiency.

All2AllDeepSpeedMegatron

0 likes · 15 min read

How DeepSpeed Ulysses Cuts Communication Overhead Compared to Megatron

Baobao Algorithm Notes

Oct 30, 2024 · Artificial Intelligence

How Sequence Parallelism Slashes Activation Memory in Megatron Training

This article provides a detailed technical walkthrough of sequence parallelism (SP) for Megatron models, covering tensor parallelism basics, precise activation memory calculations for MLP and attention layers, the SP implementation that splits activations across GPUs, and selective activation recomputation strategies that further reduce memory while preserving training speed.

MegatronTensor Parallelismactivation memory

0 likes · 20 min read

How Sequence Parallelism Slashes Activation Memory in Megatron Training

Alimama Tech

Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedDistributed TrainingGPU Optimization

0 likes · 10 min read

Megatron-LLaMA: High-Performance Large Language Model Training Framework