Tagged articles

GPU communication

10 articles · Page 1 of 1

Jan 23, 2026 · Artificial Intelligence

Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure

This article surveys the 2025 AI infrastructure landscape, covering distributed inference with PD‑separation, dynamic DOPD scheduling, AFD attention‑FFN disaggregation, high‑bandwidth cross‑machine communication libraries, the TileLang programming model, RL train‑inference decoupling via SeamlessFlow, and secure, low‑latency agent infra designs for future large‑scale models.

AI InfrastructureAgent systemsDistributed Inference

0 likes · 27 min read

Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure

AI Cyberspace

Nov 19, 2025 · Artificial Intelligence

Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs

This article explains how AI model training has evolved from single‑GPU workloads to massive distributed training using MPI for CPU‑centric communication and NCCL for GPU‑centric communication, covering their histories, core concepts, programming interfaces, topology discovery, protocol choices, and performance testing on multi‑GPU clusters.

AI distributed trainingGPU communicationHigh-performance computing

0 likes · 71 min read

Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs

Network Intelligence Research Center (NIRC)

Nov 1, 2025 · Artificial Intelligence

AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance

AutoCCL analyzes NCCL’s six key performance parameters, uses coordinate‑descent and an online leader‑worker architecture to automatically adjust them during training, overcoming state‑space explosion and compute‑communication interference, and achieves 1.07‑1.32× faster iteration times on models such as Phi‑2, Llama‑3.1‑8B and VGG‑19.

AutoCCLCoordinate DescentDistributed Deep Learning

0 likes · 5 min read

AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance

Linux Kernel Journey

May 8, 2025 · Artificial Intelligence

How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network

DeepSeek engineers highlighted Tencent’s open‑source TRMT and DeepEP contributions that boost GPU‑to‑GPU communication by up to 300%, double RoCE performance and add a further 30% gain on InfiniBand, while addressing lane‑utilization and CPU‑control bottlenecks through three targeted optimizations.

DeepEPDeepSeekGPU communication

0 likes · 6 min read

How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network

Tencent Tech

May 7, 2025 · Artificial Intelligence

How Tencent’s DeepEP Doubles GPU Communication Speed on RoCE Networks

Tencent engineers highlighted a massive speedup in DeepSeek’s open‑source DeepEP communication framework, revealing how their TRMT‑based optimizations—dynamic multi‑QP topology awareness, IBGDA‑driven CPU‑bypass, and atomic signaling—boost RoCE network throughput up to 300% and add another 30% gain when applied to InfiniBand, effectively doubling GPU communication performance for large AI models.

AI model trainingDeepEPGPU communication

0 likes · 8 min read

How Tencent’s DeepEP Doubles GPU Communication Speed on RoCE Networks

AI Cyberspace

Mar 14, 2025 · Artificial Intelligence

How NCCL Accelerates Distributed AI Training on GPUs

This article explains the origins, core functions, installation steps, and programming examples of NVIDIA’s Collective Communication Library (NCCL), detailing its role in multi‑GPU and multi‑node AI distributed training, topology discovery, path selection, channel search, and various collective communication operations.

CUDAGPU communicationMPI

0 likes · 33 min read

How NCCL Accelerates Distributed AI Training on GPUs

AsiaInfo Technology: New Tech Exploration

Oct 23, 2024 · Artificial Intelligence

How to Optimize Distributed Training for Massive AI Models: Strategies & Performance Insights

This article examines the challenges of scaling large AI models across multiple GPUs, explores data, pipeline, and tensor parallelism, analyzes collective communication patterns and data‑channel technologies such as PCIe, NVLink and RDMA, and offers concrete optimization recommendations to boost training efficiency.

GPU communicationcollective communicationdistributed training

0 likes · 21 min read

How to Optimize Distributed Training for Massive AI Models: Strategies & Performance Insights

Baobao Algorithm Notes

Sep 28, 2024 · Artificial Intelligence

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

This guide walks you through the fundamentals of distributed training for large AI models, explaining data, model, and pipeline parallelism, GPU communication primitives, and advanced techniques like Megatron 3‑D parallelism and DeepSpeed ZeRO stages, with practical examples and visual illustrations to help you design efficient multi‑GPU training pipelines.

DeepSpeedGPU communicationMegatron

0 likes · 27 min read

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

Architects' Tech Alliance

Feb 3, 2019 · Fundamentals

Understanding GPUDirect RDMA: Principles, Implementation, and Performance

This article explains the background of GPU communication, introduces DMA and RDMA fundamentals, describes how GPUDirect RDMA enables direct GPU-to-GPU memory access across machines, and presents performance results showing reduced latency and increased bandwidth for distributed deep‑learning training.

GPU communicationGPUDirectInfiniBand

0 likes · 7 min read

Understanding GPUDirect RDMA: Principles, Implementation, and Performance

Architects' Tech Alliance

Feb 1, 2019 · Industry Insights

How GPUDirect P2P Boosts Multi‑GPU Performance and What Limits It in Virtualized Environments

This article explains the background of GPU communication, details NVIDIA's GPUDirect and its Peer‑to‑Peer features, discusses virtualization challenges, and presents performance measurements on an Alibaba Cloud GN5 instance showing latency reduction and near‑linear scaling for deep‑learning workloads.

GPU communicationGPUDirectNVLink

0 likes · 6 min read

How GPUDirect P2P Boosts Multi‑GPU Performance and What Limits It in Virtualized Environments