Tagged articles

DeepSpeed

24 articles · Page 1 of 1

Jun 18, 2026 · Artificial Intelligence

How to Pick the Right Parallelism for 7B‑70B Models: DP, TP, PP, ZeRO & FSDP

This guide walks engineers through the memory, compute and bandwidth limits of training 7B‑70B models, compares data parallel (DP/DDP), tensor parallel (TP), pipeline parallel (PP), ZeRO stages and FSDP, shows how to calculate GPU memory, estimate communication overhead, configure each strategy, and avoid common pitfalls, enabling you to decide which parallelism to use on multi‑GPU or multi‑node clusters.

DeepSpeedFSDPZeRO

0 likes · 24 min read

How to Pick the Right Parallelism for 7B‑70B Models: DP, TP, PP, ZeRO & FSDP

Qborfy AI

Mar 24, 2026 · Artificial Intelligence

Why Full Fine‑Tuning Beats LoRA: When and How to Update Every Model Parameter

This article explains full fine‑tuning—updating all parameters of a pretrained model—to achieve the highest task performance, compares it with LoRA and prompt tuning, shows when it is appropriate, provides a step‑by‑step Hugging Face implementation, memory‑saving tricks, common pitfalls, and practical takeaways.

DeepSpeedFull Fine-tuningGPU memory

0 likes · 9 min read

Why Full Fine‑Tuning Beats LoRA: When and How to Update Every Model Parameter

AI2ML AI to Machine Learning

Nov 4, 2025 · Artificial Intelligence

Common Debugging Signals for Large Language Models

This article outlines the end‑to‑end workflow for large‑model training, highlights typical debugging challenges such as memory OOM, performance bottlenecks, and gradient issues, and provides concrete strategies, tools (DeepSpeed, Megatron, Torchtitan, veScale) and best‑practice checklists to help engineers diagnose and resolve problems efficiently.

DeepSpeedLLMMegatron

0 likes · 12 min read

Common Debugging Signals for Large Language Models

Fun with Large Models

Aug 30, 2025 · Artificial Intelligence

How to Fine‑Tune Large Models on Multiple Nodes and GPUs – A Must‑Know Interview Answer

This article explains how to fine‑tune large models across multiple machines and GPUs by covering data, model, tensor, and pipeline parallelism, hybrid 3D parallel strategies, engineering details such as NCCL, PyTorch Distributed, DeepSpeed, fault‑tolerance, checkpointing, and the ZeRO optimizer stages that dramatically reduce memory usage.

Data ParallelDeepSpeedMegatron-LLM

0 likes · 8 min read

How to Fine‑Tune Large Models on Multiple Nodes and GPUs – A Must‑Know Interview Answer

Network Intelligence Research Center (NIRC)

Jul 13, 2025 · Artificial Intelligence

Getting Started with Hugging Face Transformers Trainer

This guide walks through the Hugging Face Transformers Trainer library, explaining its core features such as configurable training loops, mixed‑precision and gradient‑accumulation support, seamless distributed training via Accelerate and DeepSpeed, and provides a step‑by‑step example of converting a simple PyTorch CNN model to use Trainer.

AccelerateDeepSpeedHugging Face

0 likes · 7 min read

Getting Started with Hugging Face Transformers Trainer

Sohu Tech Products

Jun 18, 2025 · Artificial Intelligence

Master LLaMA Factory Fine‑Tuning: Key Parameter Settings & Memory Optimization

This tutorial walks through LLaMA‑Factory fine‑tuning by explaining how to choose learning rate, epochs, batch size, cutoff length, LoRA rank, and validation split, and shows how to estimate and reduce GPU memory usage with techniques like gradient accumulation, liger_kernel, and DeepSpeed.

AIDeepSpeedLLaMA

0 likes · 25 min read

Master LLaMA Factory Fine‑Tuning: Key Parameter Settings & Memory Optimization

Python Programming Learning Circle

Apr 3, 2025 · Artificial Intelligence

Accelerating PyTorch Model Training: Techniques, Benchmarks, and Code

This article explains how to dramatically speed up PyTorch model training using code optimizations, mixed‑precision, torch.compile, distributed data parallelism, and DeepSpeed, presenting benchmark results that show up to 11.5× acceleration on multiple GPUs while maintaining high accuracy.

DeepSpeedGPUPyTorch

0 likes · 6 min read

Accelerating PyTorch Model Training: Techniques, Benchmarks, and Code

DataFunSummit

Jan 6, 2025 · Artificial Intelligence

Efficient Large‑Model Training with LLaMA‑Factory: Overview, Techniques, and Applications

This article explains how to train large language models efficiently using LLaMA‑Factory, covering low‑resource training challenges, memory‑saving optimizations for parameters, gradients and activations, framework features, quick‑start guidance, performance tuning, real‑world case studies, and a detailed Q&A.

AIDeepSpeedLLaMA-Factory

0 likes · 10 min read

Efficient Large‑Model Training with LLaMA‑Factory: Overview, Techniques, and Applications

Baobao Algorithm Notes

Nov 4, 2024 · Artificial Intelligence

How DeepSpeed Ulysses Cuts Communication Overhead Compared to Megatron

This article provides a detailed technical analysis of DeepSpeed Ulysses, explaining its sequence‑parallel workflow, comparing its communication volume with Megatron, and examining how All2All operations and Zero‑3 integration affect scalability and efficiency.

All2AllDeepSpeedMegatron

0 likes · 15 min read

How DeepSpeed Ulysses Cuts Communication Overhead Compared to Megatron

Baobao Algorithm Notes

Sep 28, 2024 · Artificial Intelligence

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

This guide walks you through the fundamentals of distributed training for large AI models, explaining data, model, and pipeline parallelism, GPU communication primitives, and advanced techniques like Megatron 3‑D parallelism and DeepSpeed ZeRO stages, with practical examples and visual illustrations to help you design efficient multi‑GPU training pipelines.

DeepSpeedGPU communicationMegatron

0 likes · 27 min read

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

DataFunTalk

Jul 8, 2024 · Artificial Intelligence

Challenges and Techniques for Distributed Training of Large Language Models

This article discusses the historical background, major challenges such as massive compute and memory demands, and the technical ecosystem—including data parallelism, pipeline parallelism, and optimization strategies like DeepSpeed and 1F1B—to enable efficient distributed training of large language models.

AI InfrastructureDeepSpeedpipeline parallelism

0 likes · 22 min read

Challenges and Techniques for Distributed Training of Large Language Models

360 Tech Engineering

Apr 15, 2024 · Artificial Intelligence

Fine‑Tuning Large Language Models: A Practical Guide Using Qwen‑14B on the 360AI Platform

This article explains the concept, motivations, and step‑by‑step workflow for fine‑tuning large language models—specifically Qwen‑14B—covering data preparation, training commands with DeepSpeed, hyper‑parameter settings, evaluation, and deployment via FastChat, all illustrated with code snippets and configuration details.

AIDeepSpeedFastChat

0 likes · 10 min read

Fine‑Tuning Large Language Models: A Practical Guide Using Qwen‑14B on the 360AI Platform

360 Smart Cloud

Apr 15, 2024 · Artificial Intelligence

Fine‑Tuning Qwen‑14B Large Language Model: A Complete Guide

This article provides a comprehensive tutorial on fine‑tuning the Qwen‑14B large language model, covering the motivation, fine‑tuning concepts, step‑by‑step workflow, required code, DeepSpeed training parameters, testing scripts, and deployment using FastChat and the 360AI platform.

AI model deploymentDeepSpeedFastChat

0 likes · 9 min read

Fine‑Tuning Qwen‑14B Large Language Model: A Complete Guide

DataFunSummit

Mar 31, 2024 · Artificial Intelligence

Challenges and Techniques in Distributed Training of Large Language Models

This article reviews the rapid development of large language models since 2019, outlines the historical background, identifies key challenges such as massive compute demand, memory constraints, and system complexity, and then details distributed training technologies—including data parallelism, pipeline parallelism, and advanced optimization strategies—while also discussing future research directions and answering common questions.

AI InfrastructureDeepSpeeddata parallelism

0 likes · 23 min read

Challenges and Techniques in Distributed Training of Large Language Models

OPPO Kernel Craftsman

Mar 22, 2024 · Artificial Intelligence

InternLM Model Fine-Tuning Tutorial with XTuner: Chat Format and Practical Implementation Guide

This tutorial walks through fine‑tuning Shanghai AI Lab’s open‑source InternLM models with XTuner, explaining chat‑format conventions, loading and inference (including multimodal InternLM‑XComposer), dataset preparation, configuration sections, DeepSpeed acceleration, and memory‑efficient QLoRA details for 7‑B‑parameter chat models.

Chat FormatDeepSpeedHuggingFace

0 likes · 22 min read

InternLM Model Fine-Tuning Tutorial with XTuner: Chat Format and Practical Implementation Guide

Alibaba Cloud Big Data AI Platform

Jan 12, 2024 · Artificial Intelligence

How to Fine‑Tune and Deploy Mixtral 8x7B MOE Model on Alibaba Cloud PAI

This guide walks AI developers through downloading the Mixtral 8x7B MOE large language model, fine‑tuning it with Swift or Deepspeed on Alibaba Cloud PAI‑DSW, testing inference with Transformers, and finally deploying the tuned model as an online service using PAI‑EAS.

Alibaba CloudDeepSpeedMixtral

0 likes · 13 min read

How to Fine‑Tune and Deploy Mixtral 8x7B MOE Model on Alibaba Cloud PAI

Alimama Tech

Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedGPU OptimizationLLaMA

0 likes · 10 min read

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Alibaba Cloud Native

Jun 25, 2023 · Artificial Intelligence

Accelerate Large‑Scale LLM Training on Alibaba Cloud ACK with DeepSpeed and Arena

This guide explains how to leverage Alibaba Cloud Container Service ACK's AI suite and DeepSpeed to efficiently run distributed large‑language‑model training on Kubernetes, covering prerequisites, configuration, command‑line deployment, monitoring with TensorBoard, and performance‑optimizing techniques.

AIAlibaba CloudArena

0 likes · 11 min read

Accelerate Large‑Scale LLM Training on Alibaba Cloud ACK with DeepSpeed and Arena

Alibaba Cloud Native

Jun 24, 2023 · Cloud Native

How to Deploy Distributed LLM Inference with DeepSpeed on Alibaba Cloud ACK

This guide walks through deploying a Bloom 7B1 large language model for distributed inference on Alibaba Cloud Container Service (ACK) using DeepSpeed, Arena, and Kubernetes, covering environment setup, model configuration, service launch, verification, and Ingress exposure.

ACKArenaCloud Native

0 likes · 14 min read

How to Deploy Distributed LLM Inference with DeepSpeed on Alibaba Cloud ACK

IT Architects Alliance

Apr 17, 2023 · Artificial Intelligence

DeepSpeed Chat: An Open‑Source Framework for Scalable RLHF Training of ChatGPT‑Style Models

DeepSpeed Chat provides a fast, affordable, and scalable system for end‑to‑end RLHF training of ChatGPT‑style large language models, offering one‑click scripts, detailed performance benchmarks across GPU configurations, support for many model families, and a flexible API for custom RLHF pipelines.

ChatGPTDeepSpeedGPU training

0 likes · 14 min read

DeepSpeed Chat: An Open‑Source Framework for Scalable RLHF Training of ChatGPT‑Style Models

Programmer DD

Apr 14, 2023 · Artificial Intelligence

How DeepSpeed-Chat Accelerates ChatGPT‑Style Model Training by 15×

Microsoft open‑sourced DeepSpeed‑Chat, a toolkit that streamlines the end‑to‑end training and inference of ChatGPT‑like large language models using RLHF, delivering up to fifteen‑fold speedups and dramatically lower costs, even on a single GPU.

ChatGPTDeepSpeedEfficient Training

0 likes · 8 min read

How DeepSpeed-Chat Accelerates ChatGPT‑Style Model Training by 15×

21CTO

Apr 13, 2023 · Artificial Intelligence

How Microsoft’s Open‑Source DeepSpeed‑Chat Accelerates LLM Training by 15×

Microsoft has open‑sourced DeepSpeed‑Chat, a DeepSpeed‑based framework that simplifies end‑to‑end training and inference of ChatGPT‑style large language models, offering RL‑HF support, up to 15× speed‑up, massive cost reductions, and scalable performance on Azure for models ranging from billions to hundreds of billions of parameters.

AIDeepSpeedLLM training

0 likes · 7 min read

How Microsoft’s Open‑Source DeepSpeed‑Chat Accelerates LLM Training by 15×

Architects' Tech Alliance

Aug 31, 2022 · Artificial Intelligence

Performance Evaluation of Transformer Models on the Inspur NF5488A5 GPU Server

This article presents a detailed benchmark of four Transformer models of varying sizes trained on the high‑end Inspur NF5488A5 GPU server, compares its NVSwitch‑based interconnect with a PCIe‑based system, and analyzes the impact of model scale, tensor parallelism, and hardware bandwidth on training efficiency.

DeepSpeedGPU serverMegatron-LM

0 likes · 12 min read

Performance Evaluation of Transformer Models on the Inspur NF5488A5 GPU Server

DataFunSummit

Apr 19, 2022 · Artificial Intelligence

DeepSpeed‑MoE: End‑to‑End Training and Inference Solutions for Mixture‑of‑Experts Models

This article reviews DeepSpeed‑MoE, an end‑to‑end system that introduces new MoE architectures, model‑compression techniques, and highly optimized inference pipelines, detailing its motivation, design of PR‑MoE (Pyramid‑MoE and Residual‑MoE), distributed parallel strategies, communication and kernel optimizations, and performance gains over dense baselines.

AIDeepSpeedInference Optimization

0 likes · 11 min read

DeepSpeed‑MoE: End‑to‑End Training and Inference Solutions for Mixture‑of‑Experts Models