Tagged articles
38 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
May 7, 2026 · Artificial Intelligence

How Unsloth and NVIDIA Boost Consumer‑GPU LLM Training by ~25% with Three Simple Optimizations

Unsloth and NVIDIA identified three low‑level bottlenecks in LLM fine‑tuning on consumer GPUs—repeated packed‑sequence metadata construction, serialized copy‑and‑compute during gradient checkpointing, and per‑expert routing overhead in MoE—and applied targeted patches that together deliver roughly a 25% speedup without changing hardware, code, or frameworks.

GPU OptimizationLLM trainingMixture of Experts
0 likes · 12 min read
How Unsloth and NVIDIA Boost Consumer‑GPU LLM Training by ~25% with Three Simple Optimizations
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 1, 2026 · Artificial Intelligence

What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning

The article analyzes DeepSeek V4’s post‑training pipeline, explains how multi‑expert on‑policy distillation (OPD) differs from traditional teacher‑forcing, compares reverse‑KL and forward‑KL objectives, and uses analogies to human learning to illustrate the benefits and limits of OPD.

DeepSeek-V4LLM trainingMulti-Expert Models
0 likes · 11 min read
What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 23, 2026 · Artificial Intelligence

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

DeepSeek has released TileKernels, a GPU kernel library written in the TileLang DSL, that targets H100/H200/B200 GPUs and claims to approach hardware limits in compute intensity and memory bandwidth, offering MoE routing, FP8/FP4 quantization, and dual‑language PyTorch references for deep‑learning engineers.

FP8 quantizationGPU OptimizationLLM training
0 likes · 9 min read
DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Apr 20, 2026 · Artificial Intelligence

How to Build Multi‑Step Reasoning Training Data for Deep Research Agents

Standard QA datasets fall short for deep research tasks because they lack the multi‑step, dynamic reasoning required; this article explains why, outlines four data‑construction techniques—SailorFog‑QA, WebFrontier, WebShaper, E2HQA—details trajectory sampling, filtering, scale considerations, and interview‑ready explanations.

AI AgentsLLM trainingMulti-step Reasoning
0 likes · 16 min read
How to Build Multi‑Step Reasoning Training Data for Deep Research Agents
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 9, 2026 · Artificial Intelligence

Can Self‑Iterating AI Agents Run on a Single GPU? Karpathy’s Autoresearch Demo

Karpathy’s open‑source “autoresearch” project demonstrates how a compact LLM training environment on a single GPU can let an AI agent autonomously modify code, run five‑minute training experiments, evaluate improvements, and iteratively produce better models, illustrating a new research paradigm where AI conducts experiments while humans design the system.

AI research automationAutoResearchKarpathy
0 likes · 6 min read
Can Self‑Iterating AI Agents Run on a Single GPU? Karpathy’s Autoresearch Demo
Data Party THU
Data Party THU
Jan 7, 2026 · Artificial Intelligence

Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

A recent study reveals that the widely used KL regularization in LLM reinforcement learning (RLVR) is mathematically biased, leading to unstable training and poorer generalization, and shows that moving the KL term back to the reward with a simple K1 estimator can boost out‑of‑domain performance by up to 20%.

AI researchKL regularizationLLM training
0 likes · 10 min read
Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It
AI Insight Log
AI Insight Log
Jan 1, 2026 · Artificial Intelligence

Can DeepSeek’s mHC Architecture Break ResNet’s Decade-Long Dominance?

DeepSeek’s new paper “mHC: Manifold‑Constrained Hyper‑Connections” proposes a novel architecture that replaces traditional residual connections with mathematically constrained hyper‑connections, showing on a 27B model a modest 6.7 % training‑time increase but significant stability gains and superior performance on BBH, DROP and GSM8K benchmarks.

DeepSeekLLM trainingResNet
0 likes · 8 min read
Can DeepSeek’s mHC Architecture Break ResNet’s Decade-Long Dominance?
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Nov 15, 2025 · Artificial Intelligence

How to Build Robust Function Call Training Data for LLM Agents

This article explains why function call capabilities in large language model agents require dedicated training, outlines the four core abilities to teach, describes the structure and sources of effective training data, and compares lightweight LoRA fine‑tuning with full supervised fine‑tuning approaches.

Agent SystemsData GenerationFine-tuning
0 likes · 11 min read
How to Build Robust Function Call Training Data for LLM Agents
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Nov 3, 2025 · Artificial Intelligence

Smol Training Playbook: Secrets to Building World-Class LLMs

The article details the SmolLM3 3B‑parameter model, its architecture, dual‑mode inference, a three‑stage data‑curation strategy, rigorous ablation methods, preference optimisation (APO/DPO), model merging, and practical training‑stability tricks, offering a comprehensive guide for building high‑performing large language models.

APOLLM trainingcontext scaling
0 likes · 13 min read
Smol Training Playbook: Secrets to Building World-Class LLMs
Sohu Tech Products
Sohu Tech Products
Sep 10, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

This article explains the GRPO algorithm, an improvement over PPO for large language model training that eliminates the value network, uses group‑relative advantage estimation, and offers flexible supervision, resulting in higher efficiency, stability, and performance on tasks such as mathematical reasoning.

AI OptimizationGRPOLLM training
0 likes · 16 min read
How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models
Ops Development & AI Practice
Ops Development & AI Practice
Jul 29, 2025 · Artificial Intelligence

How Ray Transforms Distributed Training for Large Language Models

In the era of data‑driven AI, Ray offers an open‑source unified compute framework that abstracts distributed system complexity, enabling developers to seamlessly scale Python code from a laptop to large GPU clusters, and provides the Ray AI Runtime (AIR) with libraries such as Ray Data, Train, Tune, and Serve to accelerate LLM training, hyper‑parameter tuning, and model serving.

AI RuntimeLLM trainingPython
0 likes · 10 min read
How Ray Transforms Distributed Training for Large Language Models
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 17, 2025 · Artificial Intelligence

How QK-Clip Tames MaxLogit Explosions in Trillion‑Parameter LLMs

The article introduces QK-Clip, a lightweight per‑head weight‑clipping technique that uses the MaxLogit signal to prevent uncontrolled logit growth in massive LLMs, explains its design, compares it with prior methods, and shows that it stabilizes training without harming model performance.

Attention stabilityLLM trainingMaxLogit
0 likes · 15 min read
How QK-Clip Tames MaxLogit Explosions in Trillion‑Parameter LLMs
Architect
Architect
Mar 16, 2025 · Artificial Intelligence

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

This article walks through the complete lifecycle of building a small large‑language model, covering token‑level inference, pre‑training, post‑training steps such as supervised fine‑tuning, reward‑model creation, and reinforcement‑learning methods like DPO, PPO and GRPO, culminating in a practical 0.5B model fine‑tuned for chain‑of‑thought reasoning.

GRPOLLM trainingReward Modeling
0 likes · 22 min read
Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning
Architect
Architect
Feb 25, 2025 · Artificial Intelligence

DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM

DeepSeek R1 builds on the DeepSeek V3 base model using a multi‑stage reinforcement learning pipeline—including GRPO optimization, rule‑based reward modeling, supervised fine‑tuning, language‑consistency rewards, rejection sampling, and distillation—to produce a high‑performing, aligned LLM capable of accurate reasoning.

DeepSeekLLM trainingReward Modeling
0 likes · 24 min read
DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM
Architect
Architect
Feb 24, 2025 · Artificial Intelligence

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

The article details the development, architectural evolution, and practical challenges of MoBA—a sparse attention framework inspired by Mixture‑of‑Experts that scales LLM context length to 10 M tokens, supports seamless switching between full and sparse attention, and is now released as a minimal open‑source solution.

AI ArchitectureContext ParallelLLM training
0 likes · 13 min read
Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts
AI Algorithm Path
AI Algorithm Path
Feb 18, 2025 · Artificial Intelligence

Build DeepSeek‑R1 from Scratch: Complete Training Process with Code Walkthrough

This article provides a step‑by‑step, code‑first guide to reproducing DeepSeek‑R1 from the ground up, covering model selection, dataset preparation, custom reward functions, GRPO reinforcement‑learning training, supervised fine‑tuning, reasoning‑oriented RL, rejection sampling, and model distillation.

DeepSeek-R1LLM trainingPython
0 likes · 48 min read
Build DeepSeek‑R1 from Scratch: Complete Training Process with Code Walkthrough
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 17, 2025 · Artificial Intelligence

How BladeDISC++ Cuts Memory Peaks for Dynamic‑Shape Deep Learning Models

This article explains the challenges of dynamic‑shape deep learning workloads and introduces BladeDISC++, an AI compiler that uses symbolic shape graphs, operation scheduling, and just‑in‑time auto‑rematerialization to dramatically reduce GPU memory peaks while maintaining training throughput.

AI compilerBladeDISC++LLM training
0 likes · 16 min read
How BladeDISC++ Cuts Memory Peaks for Dynamic‑Shape Deep Learning Models
Linux Kernel Journey
Linux Kernel Journey
Dec 22, 2024 · Artificial Intelligence

Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios

This article systematically reviews GPU monitoring for large‑scale AI training, covering MFU/HFU definitions, key DCGM metrics, NVLink bandwidth, common failure codes such as Xid and SXid, experimental insights on T4 and H100 GPUs, and practical case studies for diagnosing and mitigating performance drops.

DCGMGPU failuresGPU monitoring
0 likes · 26 min read
Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 28, 2024 · Artificial Intelligence

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

This article provides a thorough, yet concise, overview of Llama 3’s training pipeline, data handling, model architecture, scaling laws, post‑training techniques like SFT and DPO, and inference optimizations such as KV‑Cache, GQA, PagedAttention, and FP8 quantization, highlighting practical insights and benchmark results.

DPOInferenceKV cache
0 likes · 32 min read
Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 12, 2024 · Artificial Intelligence

How Pai‑Megatron‑Patch Boosts LLM Training with Offloading, FlashAttention‑3, and Communication Overlap

This article introduces Pai‑Megatron‑Patch, a suite of tools built on Nvidia Megatron‑LM that accelerates large language model training through dense and MoE model support, high‑precision HuggingFace↔MCore weight conversion, CPU offloading for optimizers and activations, FlashAttention‑3, and communication‑compute overlapping, and provides detailed experimental results and command‑line usage examples.

CPU offloadingCommunication OverlapDistributed Optimizer
0 likes · 22 min read
How Pai‑Megatron‑Patch Boosts LLM Training with Offloading, FlashAttention‑3, and Communication Overlap
NewBeeNLP
NewBeeNLP
Jul 31, 2024 · Artificial Intelligence

Training 7B–13B LLMs: Practical Tips, Hyperparameters, and Scaling Challenges

The article shares hands‑on experience training 7‑ and 13‑billion‑parameter language models, covering essential hyper‑parameters, hardware requirements, data quality considerations, open dataset resources, and the systemic difficulties that arise when scaling to trillion‑parameter models.

LLM traininghyperparameterslarge language models
0 likes · 8 min read
Training 7B–13B LLMs: Practical Tips, Hyperparameters, and Scaling Challenges
NewBeeNLP
NewBeeNLP
Jun 12, 2024 · Artificial Intelligence

Beyond Cosine Decay: Fixed LR + Quick Decay Beats Traditional Schedules in LLM Training

The article analyzes why the traditional cosine decay learning‑rate schedule hinders continued training of large language models and shows that fixed‑learning‑rate strategies such as Warmup‑Stable‑Decay, Cooldown, SWA, and Schedule‑Free Optimizer can match or surpass cosine performance while being more friendly to fine‑tuning.

LLM trainingSFOSWA
0 likes · 7 min read
Beyond Cosine Decay: Fixed LR + Quick Decay Beats Traditional Schedules in LLM Training
NewBeeNLP
NewBeeNLP
Apr 11, 2024 · Artificial Intelligence

How Karpathy Built a 1,000‑Line C LLM Trainer Without Any Deep‑Learning Framework

Andrej Karpathy released LLM.C, a pure C/CUDA implementation that trains GPT‑2‑style models in about 1,000 lines of code, detailing manual forward/backward passes, memory allocation tricks, SIMD CPU acceleration, CUDA porting, and migration tutorials, while comparing it to PyTorch and discussing broader LLM OS implications.

C programmingCUDAGPT
0 likes · 6 min read
How Karpathy Built a 1,000‑Line C LLM Trainer Without Any Deep‑Learning Framework
Architects' Tech Alliance
Architects' Tech Alliance
Apr 6, 2024 · Artificial Intelligence

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

The article analyzes ByteDance and Peking University's MegaScale system that enables efficient, stable training of large language models on clusters exceeding ten thousand GPUs, detailing algorithmic tweaks, 3D parallel communication overlap, operator optimizations, data‑pipeline improvements, network tuning, and fault‑tolerance mechanisms that together achieve a 55.2% MFU on a 175B model.

Distributed SystemsGPU clustersLLM training
0 likes · 15 min read
How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Mar 22, 2024 · Artificial Intelligence

InternLM Model Fine-Tuning Tutorial with XTuner: Chat Format and Practical Implementation Guide

This tutorial walks through fine‑tuning Shanghai AI Lab’s open‑source InternLM models with XTuner, explaining chat‑format conventions, loading and inference (including multimodal InternLM‑XComposer), dataset preparation, configuration sections, DeepSpeed acceleration, and memory‑efficient QLoRA details for 7‑B‑parameter chat models.

Chat FormatDeepSpeedFine-tuning
0 likes · 22 min read
InternLM Model Fine-Tuning Tutorial with XTuner: Chat Format and Practical Implementation Guide
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 21, 2024 · Artificial Intelligence

Can the CaR Method Achieve Better LLM Performance with Only 1.4% of Training Data?

This article explains how the CaR (Clustering and Ranking) approach evaluates data quality with a scoring model and selects diverse samples via PCA‑reduced sentence embeddings and K‑Means clustering, achieving comparable or superior large‑model performance while using just 1.96% of the original dataset.

CaR methodData QualityLLM training
0 likes · 8 min read
Can the CaR Method Achieve Better LLM Performance with Only 1.4% of Training Data?
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 28, 2024 · Artificial Intelligence

How PAI‑TorchAcc Supercharges OLMo LLM Training with Up to 1.64× Speedup

PAI‑TorchAcc, Alibaba Cloud’s PyTorch accelerator, integrates the open‑source OLMo large language model and delivers up to 1.64× faster training on OLMo‑1B and 1.52× on OLMo‑7B by leveraging graph capture, distributed, compute, communication, and memory optimizations, with detailed usage steps and performance analysis.

LLM trainingOLMoPAI‑TorchAcc
0 likes · 7 min read
How PAI‑TorchAcc Supercharges OLMo LLM Training with Up to 1.64× Speedup
Architects' Tech Alliance
Architects' Tech Alliance
Dec 24, 2023 · Artificial Intelligence

Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training

This article examines the main GPU/TPU cluster networking options—including NVLink, InfiniBand, RoCE Ethernet Fabric, and DDC full‑schedule networks—explaining their latency, loss‑less transmission, congestion control, cost, scalability, and suitability for large‑scale LLM training workloads.

GPU networkingInfiniBandLLM training
0 likes · 18 min read
Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 13, 2023 · Artificial Intelligence

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

This article introduces Pai‑Megatron‑Patch, an open‑source tool from Alibaba Cloud that streamlines large language model (LLM) training, weight conversion, FP8 mixed‑precision acceleration, and reinforcement‑learning workflows, providing detailed architecture, key features, code examples, and step‑by‑step usage instructions.

FP8LLM trainingMegatron
0 likes · 19 min read
How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud
Baobao Algorithm Notes
Baobao Algorithm Notes
Aug 21, 2023 · Artificial Intelligence

Mastering LLM Training: From Tokenizer Design to Instruction Tuning

This article provides a comprehensive, step‑by‑step guide to building large language models, covering tokenizer creation, vocabulary expansion, pre‑training strategies, dataset cleaning, instruction‑tuning techniques, and evaluation metrics such as C‑Eval and GPT‑4 based scoring.

LLM training
0 likes · 20 min read
Mastering LLM Training: From Tokenizer Design to Instruction Tuning
21CTO
21CTO
Apr 13, 2023 · Artificial Intelligence

How Microsoft’s Open‑Source DeepSpeed‑Chat Accelerates LLM Training by 15×

Microsoft has open‑sourced DeepSpeed‑Chat, a DeepSpeed‑based framework that simplifies end‑to‑end training and inference of ChatGPT‑style large language models, offering RL‑HF support, up to 15× speed‑up, massive cost reductions, and scalable performance on Azure for models ranging from billions to hundreds of billions of parameters.

AIDeepSpeedLLM training
0 likes · 7 min read
How Microsoft’s Open‑Source DeepSpeed‑Chat Accelerates LLM Training by 15×