Tagged articles
10 articles
Page 1 of 1
Tencent Technical Engineering
Tencent Technical Engineering
Jan 23, 2026 · Artificial Intelligence

Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure

This article surveys the 2025 AI infrastructure landscape, covering distributed inference with PD‑separation, dynamic DOPD scheduling, AFD attention‑FFN disaggregation, high‑bandwidth cross‑machine communication libraries, the TileLang programming model, RL train‑inference decoupling via SeamlessFlow, and secure, low‑latency agent infra designs for future large‑scale models.

AI InfrastructureAgent SystemsDistributed inference
0 likes · 27 min read
Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure
AI Cyberspace
AI Cyberspace
Nov 19, 2025 · Artificial Intelligence

Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs

This article explains how AI model training has evolved from single‑GPU workloads to massive distributed training using MPI for CPU‑centric communication and NCCL for GPU‑centric communication, covering their histories, core concepts, programming interfaces, topology discovery, protocol choices, and performance testing on multi‑GPU clusters.

AI distributed trainingGPU communicationHigh‑performance computing
0 likes · 71 min read
Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Nov 1, 2025 · Artificial Intelligence

AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance

AutoCCL analyzes NCCL’s six key performance parameters, uses coordinate‑descent and an online leader‑worker architecture to automatically adjust them during training, overcoming state‑space explosion and compute‑communication interference, and achieves 1.07‑1.32× faster iteration times on models such as Phi‑2, Llama‑3.1‑8B and VGG‑19.

AutoCCLCoordinate DescentDistributed Deep Learning
0 likes · 5 min read
AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance
Linux Kernel Journey
Linux Kernel Journey
May 8, 2025 · Artificial Intelligence

How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network

DeepSeek engineers highlighted Tencent’s open‑source TRMT and DeepEP contributions that boost GPU‑to‑GPU communication by up to 300%, double RoCE performance and add a further 30% gain on InfiniBand, while addressing lane‑utilization and CPU‑control bottlenecks through three targeted optimizations.

DeepEPDeepSeekGPU communication
0 likes · 6 min read
How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network
Tencent Tech
Tencent Tech
May 7, 2025 · Artificial Intelligence

How Tencent’s DeepEP Doubles GPU Communication Speed on RoCE Networks

Tencent engineers highlighted a massive speedup in DeepSeek’s open‑source DeepEP communication framework, revealing how their TRMT‑based optimizations—dynamic multi‑QP topology awareness, IBGDA‑driven CPU‑bypass, and atomic signaling—boost RoCE network throughput up to 300% and add another 30% gain when applied to InfiniBand, effectively doubling GPU communication performance for large AI models.

AI model trainingDeepEPGPU communication
0 likes · 8 min read
How Tencent’s DeepEP Doubles GPU Communication Speed on RoCE Networks
AI Cyberspace
AI Cyberspace
Mar 14, 2025 · Artificial Intelligence

How NCCL Accelerates Distributed AI Training on GPUs

This article explains the origins, core functions, installation steps, and programming examples of NVIDIA’s Collective Communication Library (NCCL), detailing its role in multi‑GPU and multi‑node AI distributed training, topology discovery, path selection, channel search, and various collective communication operations.

CUDAGPU communicationMPI
0 likes · 33 min read
How NCCL Accelerates Distributed AI Training on GPUs
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
Oct 23, 2024 · Artificial Intelligence

How to Optimize Distributed Training for Massive AI Models: Strategies & Performance Insights

This article examines the challenges of scaling large AI models across multiple GPUs, explores data, pipeline, and tensor parallelism, analyzes collective communication patterns and data‑channel technologies such as PCIe, NVLink and RDMA, and offers concrete optimization recommendations to boost training efficiency.

Distributed TrainingGPU communicationcollective communication
0 likes · 21 min read
How to Optimize Distributed Training for Massive AI Models: Strategies & Performance Insights
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 28, 2024 · Artificial Intelligence

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

This guide walks you through the fundamentals of distributed training for large AI models, explaining data, model, and pipeline parallelism, GPU communication primitives, and advanced techniques like Megatron 3‑D parallelism and DeepSpeed ZeRO stages, with practical examples and visual illustrations to help you design efficient multi‑GPU training pipelines.

Data ParallelismDeepSpeedDistributed Training
0 likes · 27 min read
Master Distributed Training for Massive AI Models on Multi‑GPU Clusters
Architects' Tech Alliance
Architects' Tech Alliance
Feb 3, 2019 · Fundamentals

Understanding GPUDirect RDMA: Principles, Implementation, and Performance

This article explains the background of GPU communication, introduces DMA and RDMA fundamentals, describes how GPUDirect RDMA enables direct GPU-to-GPU memory access across machines, and presents performance results showing reduced latency and increased bandwidth for distributed deep‑learning training.

Deep LearningGPU communicationGPUDirect
0 likes · 7 min read
Understanding GPUDirect RDMA: Principles, Implementation, and Performance
Architects' Tech Alliance
Architects' Tech Alliance
Feb 1, 2019 · Industry Insights

How GPUDirect P2P Boosts Multi‑GPU Performance and What Limits It in Virtualized Environments

This article explains the background of GPU communication, details NVIDIA's GPUDirect and its Peer‑to‑Peer features, discusses virtualization challenges, and presents performance measurements on an Alibaba Cloud GN5 instance showing latency reduction and near‑linear scaling for deep‑learning workloads.

Deep LearningGPU communicationGPUDirect
0 likes · 6 min read
How GPUDirect P2P Boosts Multi‑GPU Performance and What Limits It in Virtualized Environments