Tagged articles

post-training

24 articles · Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

Jun 19, 2026 · Artificial Intelligence

Can Post‑Training Close the Gap to Mythos‑Level AI? Musk Says 9 Months, Tang Says Faster

The article analyzes whether post‑training on GLM‑5.1/5.2 can bridge the gap to the banned Mythos model, citing Musk’s nine‑month claim, Tang’s rebuttal, Mind Lab’s benchmark gains, architectural adaptations, and the high barriers that make post‑training a critical yet scarce capability in China.

BenchmarkGLM-5.2IndexCache

0 likes · 9 min read

Can Post‑Training Close the Gap to Mythos‑Level AI? Musk Says 9 Months, Tang Says Faster

Machine Heart

Jun 19, 2026 · Artificial Intelligence

Who Is Quietly Building China’s Mythos‑Level AI? Musk Says 9 Months, Tang Says It’s Not That Fast

The article analyzes China’s race to achieve Mythos‑level intelligence, contrasting Musk’s nine‑month claim with Tang’s skepticism, and highlights Mind Lab’s unique post‑training work on GLM‑5.1/5.2 that has already delivered significant benchmark gains, while outlining the technical hurdles and timeline uncertainties.

AI development in ChinaBenchmarkGLM-5.2

0 likes · 8 min read

Who Is Quietly Building China’s Mythos‑Level AI? Musk Says 9 Months, Tang Says It’s Not That Fast

Machine Learning Algorithms & Natural Language Processing

Jun 18, 2026 · Artificial Intelligence

Can a 3B Model Rival Claude Opus 4.5? Benchmark Gaps or Aggressive Post‑Training?

VibeThinker‑3B, a 3‑billion‑parameter language model built on Qwen2.5‑Coder‑3B, achieves scores within the range of 671 B‑parameter models on benchmarks such as LiveCodeBench, AIME26, IMO‑AnswerBench and GPQA, thanks to a two‑stage SFT, multi‑domain reinforcement learning, offline self‑distillation and a claim‑reliability (CLR) evaluator that together push its reasoning ability to the frontier.

Large Language ModelsParameter EfficiencyVibeThinker-3B

0 likes · 9 min read

Can a 3B Model Rival Claude Opus 4.5? Benchmark Gaps or Aggressive Post‑Training?

Network Intelligence Research Center (NIRC)

May 25, 2026 · Artificial Intelligence

What Does On-Policy Distillation Really Teach Large Language Models?

On-Policy Distillation (OPD) trains large language models by letting the student generate its own inference paths while the teacher supplies token‑level guidance, offering denser signals than RL but sometimes failing when teacher and student reasoning diverge, as detailed by THUNLP’s recent study.

Distillation MetricsLarge Language ModelsOn‑Policy Distillation

0 likes · 8 min read

What Does On-Policy Distillation Really Teach Large Language Models?

Baobao Algorithm Notes

May 22, 2026 · Artificial Intelligence

How LiteScale Cuts Wait Times in Large‑Model Post‑Training with Gradient Accumulation

The article examines the bottleneck of synchronous rollout in large‑model post‑training, proposes an asynchronous design using gradient accumulation and a global micro‑batch count to preserve loss equivalence, and introduces LogitsExpress for efficient top‑K knowledge‑distillation communication, all implemented in the lightweight LiteScale framework.

Knowledge DistillationLarge Language Modelsasynchronous rollout

0 likes · 16 min read

How LiteScale Cuts Wait Times in Large‑Model Post‑Training with Gradient Accumulation

Machine Heart

May 8, 2026 · Artificial Intelligence

How an 8B Video‑Language Model Beats GPT‑5 and Gemini‑3.1‑Pro at Cinematic Understanding

The CHAI framework introduced by CMU and Harvard defines a structured video‑language annotation scheme, scalable human‑AI oversight, and a post‑training pipeline that enables an 8B open‑source model to outperform closed‑source GPT‑5 and Gemini‑3.1‑Pro on professional cinematic techniques.

AnnotationMultimodal AIQwen3-VL

0 likes · 11 min read

How an 8B Video‑Language Model Beats GPT‑5 and Gemini‑3.1‑Pro at Cinematic Understanding

Xiaomi Tech

Apr 27, 2026 · Artificial Intelligence

Xiaomi‑Robotics‑0: 20‑Hour Post‑Training Enables Seamless Earphone‑Box Assembly (Open‑Source)

The article details how Xiaomi‑Robotics‑0 achieves precise earphone‑to‑case insertion after only 20 hours of post‑training, outlines the sub‑millimetre precision challenges, presents a triple‑strategy (asynchronous execution, adaptive loss re‑weighting, Λ‑shape attention mask and random masking) to avoid the "lazy effect", and releases the full pipeline and code as open source for the robotics community.

Asynchronous ExecutionEmbodied AIXiaomi Robotics

0 likes · 6 min read

Xiaomi‑Robotics‑0: 20‑Hour Post‑Training Enables Seamless Earphone‑Box Assembly (Open‑Source)

Architect

Apr 25, 2026 · Artificial Intelligence

DeepSeek V4: 1M‑Token Context’s Impact on Model, Inference, Cache & Agents

The DeepSeek V4 technical report shows how a 1 million‑token context forces a redesign of attention, KV‑cache, optimizer, quantization and inference budgeting, turning long‑context capability from a costly showcase into a production‑ready feature for agents, search and Chinese professional tasks.

1M contextAgentic SearchAttention optimization

0 likes · 28 min read

DeepSeek V4: 1M‑Token Context’s Impact on Model, Inference, Cache & Agents

AIWalker

Apr 20, 2026 · Artificial Intelligence

How VA‑π Bridges Tokenizers and Autoregressive Generators for Pixel‑Perfect Images

VA‑π introduces a lightweight post‑training framework that uses variational inference and reinforcement learning to align tokenizers with visual autoregressive generators, achieving dramatic quality gains, extreme training efficiency, and robust pixel‑level reconstruction across diverse image generation tasks.

Autoregressive ModelsPixel Alignmentpost-training

0 likes · 14 min read

How VA‑π Bridges Tokenizers and Autoregressive Generators for Pixel‑Perfect Images

Machine Heart

Apr 19, 2026 · Artificial Intelligence

World Engine: How Post‑Training Is Launching a New Era of Physical AGI

World Engine introduces a post‑training pipeline that combines high‑fidelity 3DGS simulation, hard‑case mining with diffusion generation, and reinforcement‑learning optimization to give autonomous‑driving models true decision‑making ability, surpassing data‑scaling limits and achieving significant safety gains in both industrial simulations and real‑world tests.

Simulationautonomous drivinghard case mining

0 likes · 11 min read

World Engine: How Post‑Training Is Launching a New Era of Physical AGI

Data Party THU

Apr 12, 2026 · Artificial Intelligence

What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends

This article systematically reviews the core post‑training techniques for large language models—including supervised fine‑tuning, RLHF, PPO, GRPO, DPO, RLVR and Agentic RL—explains their evolution, compares their trade‑offs, and highlights the most promising research directions for 2025‑2026.

AI alignmentGRPOLLM

0 likes · 20 min read

What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends

Machine Learning Algorithms & Natural Language Processing

Mar 28, 2026 · Artificial Intelligence

A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL

This article systematically explains the post‑training pipeline for large language models, covering supervised fine‑tuning, RLHF, PPO, GRPO, RLVR, DPO and emerging Agentic RL, while illustrating each method with analogies, detailed workflows, tables, and recent research findings.

Agentic RLDPOGRPO

0 likes · 24 min read

A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL

Machine Learning Algorithms & Natural Language Processing

Mar 15, 2026 · Artificial Intelligence

Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods

The MIT‑CSAIL paper introduces RandOpt, a single‑step, gradient‑free, fully parallel post‑training algorithm that adds Gaussian noise to pretrained LLM weights and ensembles the results, achieving or surpassing PPO/GRPO performance by exploiting dense "neural thickets" that emerge as model scale grows.

EnsembleLLMRandOpt

0 likes · 12 min read

Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods

Baobao Algorithm Notes

Mar 3, 2026 · Artificial Intelligence

Boosting LLM Post-Training with RL: Tips for Efficiency and Stability

This article shares practical insights and pitfalls from six months of applying reinforcement learning to fine‑tune large language models, covering exploration efficiency, training stability, model selection, and special considerations for thinking‑oriented agents.

AIEfficiencyLLM

0 likes · 12 min read

Boosting LLM Post-Training with RL: Tips for Efficiency and Stability

PaperAgent

Jan 8, 2026 · Artificial Intelligence

How SOP Enables Scalable Online Post-Training for Real‑World Robots

The SOP (Scalable Online Post‑training) framework redesigns VLA post‑training from offline, single‑machine, sequential processing to a distributed, parallel online system, allowing robot fleets to continuously learn, share experiences, and scale intelligence while maintaining stability and generalization in complex real‑world environments.

SOPVLAdistributed training

0 likes · 11 min read

How SOP Enables Scalable Online Post-Training for Real‑World Robots

Baobao Algorithm Notes

Nov 11, 2025 · Artificial Intelligence

Why Redesign the Training Stack? Inside Olmo‑Thinking’s Open‑Source RL Journey

This article provides a detailed technical analysis of the Olmo‑Thinking project, covering why a new open‑source LLM was built, the challenges of reinforcement learning at scale, data‑mix optimization, architectural bottlenecks such as missing GQA and QK‑Norm, and the post‑training techniques used to improve reasoning and long‑context capabilities.

RLVRdata selectionopen-source models

0 likes · 20 min read

Why Redesign the Training Stack? Inside Olmo‑Thinking’s Open‑Source RL Journey

Alibaba Cloud Developer

Jul 31, 2025 · Artificial Intelligence

Why Post‑Training Matters: Scaling Laws, Fine‑Tuning, and RL Strategies for LLMs

This article explores the importance of post‑training for large language models, explains scaling laws for pre‑ and post‑training, details common fine‑tuning methods (full, PEFT, LoRA), outlines alignment techniques such as RLHF, DPO, PPO, and presents practical workflows using Llama 3 and DeepSeek‑R1, while also discussing test‑time reasoning optimizations.

LLMRLHFalignment

0 likes · 19 min read

Why Post‑Training Matters: Scaling Laws, Fine‑Tuning, and RL Strategies for LLMs

Alibaba Cloud Big Data AI Platform

Jul 16, 2025 · Artificial Intelligence

Master Post-Training: Fine-Tune LLMs with SFT, DPO, and GRPO on Alibaba PAI

This article explains post‑training concepts, compares SFT, DPO, and GRPO fine‑tuning methods, and provides step‑by‑step guidance for using Alibaba Cloud's PAI platform—including Model Gallery and DSW—to fine‑tune large language models with code examples and practical tips.

DPOGRPOLLM

0 likes · 14 min read

Master Post-Training: Fine-Tune LLMs with SFT, DPO, and GRPO on Alibaba PAI

Alibaba Cloud Big Data AI Platform

Jun 25, 2025 · Artificial Intelligence

Boost Post‑Training Efficiency with Cosmos‑RL, Ray, and VeRL on Alibaba PAI

This article introduces Alibaba Cloud's PAI platform and demonstrates how open‑source reinforcement‑learning frameworks such as Cosmos‑RL, Ray, and VeRL accelerate post‑training for large language models, offering higher throughput, fault‑tolerance, and seamless integration for AI developers.

AI platformOpen Source Frameworksdistributed training

0 likes · 9 min read

Boost Post‑Training Efficiency with Cosmos‑RL, Ray, and VeRL on Alibaba PAI

Baobao Algorithm Notes

Mar 21, 2025 · Artificial Intelligence

Unlocking LLM Reasoning: A Deep Dive into Post‑Training Techniques

This article provides a comprehensive technical overview of large language model post‑training, covering fine‑tuning methods (full, parameter‑efficient, LoRA families, prompt tuning), domain‑adaptive tuning, reinforcement‑learning reward modeling, process vs. outcome rewards, inference‑enhancement strategies, dynamic compute allocation, verifier‑augmented reasoning, current challenges, and emerging research directions such as meta‑cognition, physical reasoning, and swarm intelligence.

LLMmeta-cognitionpost-training

0 likes · 21 min read

Unlocking LLM Reasoning: A Deep Dive into Post‑Training Techniques

Code Mala Tang

Feb 19, 2025 · Artificial Intelligence

Compute Power’s Role in the AI Race: Insights from Grok 3, DeepSeek & the Post‑Training Era

The article analyzes how massive compute resources drive AI breakthroughs, highlighting Grok 3's top‑tier performance, DeepSeek's efficient engineering under constraints, and the emerging post‑training paradigm that reshapes competition among major AI players.

AI scalingDeepSeekGrok 3

0 likes · 7 min read

Compute Power’s Role in the AI Race: Insights from Grok 3, DeepSeek & the Post‑Training Era

Baobao Algorithm Notes

Jan 8, 2025 · Artificial Intelligence

Inside Llama 3.1, DeepSeek‑V3, TÜLU 3 & Qwen 2.5: A Deep Dive into Post‑Training Techniques

This article compiles and analyzes the post‑training pipelines of Llama 3.1, DeepSeek‑V3, TÜLU 3 and Qwen 2.5, detailing their data compositions, SFT, reward modeling, DPO, GRPO, RLVR methods, hyper‑parameters, and practical tricks for large‑language‑model alignment.

DPODeepSeek-V3Llama3.1

0 likes · 22 min read

Inside Llama 3.1, DeepSeek‑V3, TÜLU 3 & Qwen 2.5: A Deep Dive into Post‑Training Techniques

Baobao Algorithm Notes

Sep 29, 2024 · Artificial Intelligence

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

This article analyses the training tricks behind OpenAI's o1 model, explaining test/inference‑time scaling laws, post‑training techniques, process‑supervised reward models (PRM), various inference‑time search methods, data‑collection pipelines, and the trade‑offs between allocating compute to pre‑training versus inference.

LLM InferenceOpenAI o1Reward Model

0 likes · 34 min read

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

NewBeeNLP

Sep 23, 2024 · Artificial Intelligence

Why Post‑Training Is Redefining LLMs: DPO vs PPO, Synthetic Data, and Scaling Strategies

This article analyzes recent post‑training trends in large language models, comparing DPO and PPO, examining the scarcity of open‑source preference data, the iterative training process, the rise of synthetic data pipelines, and emerging methods for improving math and reasoning capabilities.

DPOLLMPPO

0 likes · 12 min read

Why Post‑Training Is Redefining LLMs: DPO vs PPO, Synthetic Data, and Scaling Strategies