Tag

distributed training

1 views collected around this technical thread.

Architects' Tech Alliance
Architects' Tech Alliance
May 26, 2025 · Fundamentals

Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training

The article explains how distributed AI training performance depends on reducing inter‑card communication latency, introduces RDMA technology and its implementations (InfiniBand, RoCEv2, iWARP), compares their latency and scalability against traditional TCP/IP, and outlines the hardware components and trade‑offs of InfiniBand and RoCEv2 networks.

High Performance ComputingInfiniBandRDMA
0 likes · 12 min read
Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training
Baidu Geek Talk
Baidu Geek Talk
Apr 14, 2025 · Artificial Intelligence

PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development

PaddlePaddle Framework 3.0 delivers five breakthroughs—dynamic‑static unified automatic parallelism, integrated training‑inference pipelines, high‑order scientific differentiation, a neural‑network compiler with automatic operator fusion, and streamlined heterogeneous chip adaptation—drastically reducing development effort, boosting training speed, and expanding compatibility for large‑scale AI models.

AI infrastructureLarge Language ModelsModel Inference Optimization
0 likes · 23 min read
PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development
Python Programming Learning Circle
Python Programming Learning Circle
Apr 3, 2025 · Artificial Intelligence

Accelerating PyTorch Model Training: Techniques, Benchmarks, and Code

This article explains how to dramatically speed up PyTorch model training using code optimizations, mixed‑precision, torch.compile, distributed data parallelism, and DeepSpeed, presenting benchmark results that show up to 11.5× acceleration on multiple GPUs while maintaining high accuracy.

DeepSpeedGPUMixed Precision
0 likes · 6 min read
Accelerating PyTorch Model Training: Techniques, Benchmarks, and Code
JD Retail Technology
JD Retail Technology
Mar 4, 2025 · Artificial Intelligence

JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications

JD Retail’s Nine‑Number Algorithm Platform delivers an end‑to‑end AI engine that unifies GPU and domestic NPU resources across a thousand‑card cluster, offering zero‑cost model migration, optimized training and inference pipelines, support for over 40 LLM and multimodal models, and proven business‑level performance that reduces dependence on overseas chips.

AIGPUInference
0 likes · 19 min read
JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications
DataFunSummit
DataFunSummit
Mar 3, 2025 · Artificial Intelligence

DeepSeek Open Source Week: Seven Core Technologies Reshaping Large‑Model Training

The DeepSeek open‑source week introduced seven breakthrough technologies—FlashMLA, DeepGEMM, DeepEP, DualPipe, EPLB, 3FS, and Smallpond—that together overhaul data flow, algorithmic complexity, hardware utilization, MoE communication, and resource balancing, dramatically improving large‑model training efficiency and lowering entry barriers for the AI industry.

AI hardwareDeepSeekLarge Models
0 likes · 17 min read
DeepSeek Open Source Week: Seven Core Technologies Reshaping Large‑Model Training
JD Tech Talk
JD Tech Talk
Mar 3, 2025 · Artificial Intelligence

AI Engine Technology Based on Domestic Chips for JD Retail

This article describes JD Retail's AI engine built on domestic NPU chips, covering challenges, heterogeneous GPU‑NPU scheduling, high‑performance training and inference engines, extensive model support, real‑world deployment cases, and future plans for large‑scale chip clusters and ecosystem development.

AIGPUInference
0 likes · 20 min read
AI Engine Technology Based on Domestic Chips for JD Retail
DataFunTalk
DataFunTalk
Mar 2, 2025 · Artificial Intelligence

Implementing GRPO from Scratch with Distributed Reinforcement Learning on Qwen2.5-1.5B-Instruct

This tutorial explains how to build a distributed reinforcement‑learning pipeline using the GRPO algorithm, covering data preparation, evaluation and reward functions, multi‑GPU DataParallel implementation, and full fine‑tuning of the Qwen2.5‑1.5B‑Instruct model with PyTorch, FlashAttention2 and Weights & Biases.

AIGRPOPyTorch
0 likes · 10 min read
Implementing GRPO from Scratch with Distributed Reinforcement Learning on Qwen2.5-1.5B-Instruct
DataFunSummit
DataFunSummit
Feb 14, 2025 · Artificial Intelligence

Building Large‑Scale Recommendation Systems with Big Data and Large Language Models on Alibaba Cloud AI Platform

This presentation details how Alibaba Cloud's AI platform integrates big‑data pipelines, feature‑store services, and large language model capabilities to construct high‑performance search‑recommendation architectures, covering system design, training and inference optimizations, LLM‑driven use cases, and open‑source RAG tooling.

AI PlatformBig DataFeature Store
0 likes · 17 min read
Building Large‑Scale Recommendation Systems with Big Data and Large Language Models on Alibaba Cloud AI Platform
DataFunSummit
DataFunSummit
Jan 28, 2025 · Artificial Intelligence

Few-Shot Learning for Multi-New-Class Scenarios: Challenges, Methodology, and Experimental Evaluation

This article introduces a novel few‑shot learning approach tailored for multi‑new‑class scenarios, discusses its background, problem definition, proposed parallel training framework, hierarchical fine‑tuning method, and presents extensive experiments demonstrating superior performance and computational efficiency.

computer visiondistributed trainingfew-shot learning
0 likes · 10 min read
Few-Shot Learning for Multi-New-Class Scenarios: Challenges, Methodology, and Experimental Evaluation
DataFunSummit
DataFunSummit
Jan 21, 2025 · Artificial Intelligence

NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF

This article presents NVIDIA's NeMo technology stack for end‑to‑end large language model (LLM) training, covering the full software pipeline, model alignment with reinforcement learning from human feedback (RLHF), performance optimizations such as model parallelism, FP8, TensorRT‑LLM inference, dynamic load balancing, and future research directions.

GPU optimizationLLMNeMo
0 likes · 24 min read
NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jan 2, 2025 · Artificial Intelligence

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Xiaohongshu’s team unveiled a self‑developed RLHF system that trains multimodal large language models using heterogeneous and homogeneous network architectures, extensive PPO optimizations, and Medusa speculative sampling, achieving over 50% throughput gains, reduced hardware needs, and 5‑20% performance improvements on zero‑shot benchmarks.

MedusaPPOPRM
0 likes · 21 min read
Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance
DataFunSummit
DataFunSummit
Dec 30, 2024 · Artificial Intelligence

Colossal-AI: A Scalable Framework for Distributed Training of Large Models

This presentation introduces the challenges of the large‑model era, describes the Colossal‑AI architecture—including N‑dimensional parallelism, heterogeneous storage, and zero‑code experience—shows benchmark results and real‑world use cases, and answers audience questions about its integration with PyTorch and advanced parallel strategies.

AI infrastructureColossal-AILarge Models
0 likes · 11 min read
Colossal-AI: A Scalable Framework for Distributed Training of Large Models
Kuaishou Large Model
Kuaishou Large Model
Nov 22, 2024 · Artificial Intelligence

Boost LLM Training on Massive Clusters with DP/TP Overlap and Context Parallelism

This article details a comprehensive set of techniques—including data‑ and tensor‑parallel overlap, context‑parallelism, activation rematerialization, and a performance‑driven cost model—that dramatically improve large‑language‑model training efficiency on ultra‑large GPU clusters while preserving model quality.

Large Language ModelsParallelismactivation recomputation
0 likes · 28 min read
Boost LLM Training on Massive Clusters with DP/TP Overlap and Context Parallelism
Kuaishou Tech
Kuaishou Tech
Nov 21, 2024 · Artificial Intelligence

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

This article summarizes the challenges of distributed training for massive language models and presents a suite of solutions—including DP/TP/PP overlap, context parallelism, efficient recomputation, and a performance‑aware cost model—that together boost training throughput by over 30% on large GPU clusters.

GPU clustersLarge Language Modelsactivation rematerialization
0 likes · 27 min read
Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters
Tencent Tech
Tencent Tech
Nov 19, 2024 · Artificial Intelligence

How Tencent’s Angel Platform Secured the 2024 World Internet Conference Leading Technology Award

Tencent’s Angel machine learning platform, recognized for breakthroughs in trillion‑scale model training, inference, and deployment, won the 2024 World Internet Conference Leading Technology Award, highlighting its self‑developed hardware‑software stack, high‑performance networking, and extensive real‑world AI applications.

AI PlatformAngelLarge Models
0 likes · 6 min read
How Tencent’s Angel Platform Secured the 2024 World Internet Conference Leading Technology Award
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 11, 2024 · Artificial Intelligence

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.

AI infrastructureGPU ClusterKubernetes
0 likes · 22 min read
How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling
DataFunSummit
DataFunSummit
Oct 5, 2024 · Artificial Intelligence

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

This article details the performance‑focused optimizations applied to TorchRec, PyTorch's large‑scale recommendation system library, including CUDA graph capture, multithreaded kernel launches, pinned memory copies, and input‑distribution refinements that together achieve a 2.25× speedup on MLPerf DLRM‑DCNv2 across 16 DGX H100 nodes.

CUDA GraphGPU optimizationPyTorch
0 likes · 11 min read
Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch
Baidu Geek Talk
Baidu Geek Talk
Aug 26, 2024 · Artificial Intelligence

RLHF Performance Optimization: PPO Algorithm Acceleration Techniques

The article presents three RLHF‑PPO acceleration techniques—TRT‑LLM‑based text generation speedups, selective activation recomputation with sequence parallelism for dynamic memory reduction, and overlapping pipeline stages for system‑level parallelism—demonstrating a 350 % throughput boost on a 10 B model using 16 A100 GPUs.

GPU optimizationLarge Language ModelsPPO optimization
0 likes · 16 min read
RLHF Performance Optimization: PPO Algorithm Acceleration Techniques
DataFunSummit
DataFunSummit
Aug 8, 2024 · Artificial Intelligence

GPU Throughput and Low‑Latency Optimization Practices in JD Advertising

This article presents JD Advertising's technical practices for improving GPU throughput and reducing latency in large‑scale recommendation scenarios, covering system challenges, storage and compute optimizations for training, low‑latency inference techniques, and compiler extensions to handle massive sparse models.

AIGPU optimizationRecommendation systems
0 likes · 13 min read
GPU Throughput and Low‑Latency Optimization Practices in JD Advertising
360 Smart Cloud
360 Smart Cloud
Jul 17, 2024 · Artificial Intelligence

Parallelism and Memory‑Optimization Techniques for Distributed Large‑Scale Transformer Training

This article reviews the principles and practical implementations of data, pipeline, tensor, sequence, and context parallelism together with memory‑saving strategies such as recomputation and ZeRO, and demonstrates how the QLM framework leverages these techniques to accelerate large‑model training and fine‑tuning on multi‑GPU clusters.

GPULarge Language ModelsMegatron-LM
0 likes · 18 min read
Parallelism and Memory‑Optimization Techniques for Distributed Large‑Scale Transformer Training