Tagged articles

658 articles

Page 4 of 7

Jun 25, 2025 · Artificial Intelligence

Introducing ROLL: A Scalable, User‑Friendly RL Framework for Large‑Scale LLM Training

ROLL is an open‑source reinforcement‑learning framework designed for large language model post‑training that combines multi‑task RL, agentic support, flexible algorithm configuration, elastic resource scheduling, and rich observability, delivering significant accuracy gains across benchmarks while remaining easy to use for researchers, product developers, and infrastructure engineers.

AI FrameworkRLHFScalable Training

0 likes · 11 min read

Introducing ROLL: A Scalable, User‑Friendly RL Framework for Large‑Scale LLM Training

DataFunTalk

Jun 21, 2025 · Artificial Intelligence

Why AI Gets Overconfident: Bias, Hallucinations, and Reinforcement Learning Solutions

This talk explores how large AI models become overconfident, leading to bias and hallucinations, examines adversarial examples in vision and language, explains why data and algorithms cause these issues, and shows how reinforcement learning can teach models to admit uncertainty and align with human values.

AI AlignmentAI SafetyBias

0 likes · 19 min read

Why AI Gets Overconfident: Bias, Hallucinations, and Reinforcement Learning Solutions

Kuaishou Large Model

Jun 20, 2025 · Artificial Intelligence

How OneRec Revolutionizes Short-Video Recommendations with End-to-End Generative AI

OneRec, an end-to-end generative recommendation system from Kuaishou, uses an encoder-decoder architecture, reward-based preference alignment, and reinforcement learning to dramatically improve video recommendation efficiency, boosting user engagement and reducing operational costs while achieving scaling-law performance comparable to large language models.

Kuaishouefficiencygenerative AI

0 likes · 18 min read

How OneRec Revolutionizes Short-Video Recommendations with End-to-End Generative AI

Kuaishou Tech

Jun 20, 2025 · Artificial Intelligence

How OneRec Redefines Recommendation with End‑to‑End Generative Modeling and RL Alignment

The OneRec system from Kuaishou replaces traditional cascade recommendation pipelines with an encoder‑decoder architecture, leverages reward‑based preference alignment via reinforcement learning, achieves ten‑fold FLOPs gains, cuts operational costs by 90%, and delivers significant user‑engagement improvements across short‑video and local‑service scenarios.

Generative ModelingKuaishouOneRec

0 likes · 17 min read

How OneRec Redefines Recommendation with End‑to‑End Generative Modeling and RL Alignment

Xiaohongshu Tech REDtech

Jun 19, 2025 · Artificial Intelligence

Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?

The article introduces the Think When You Need (TWYN) method, a reinforcement‑learning approach that dynamically adapts chain‑of‑thought length, dramatically cuts redundant token generation in large language models, and maintains or improves accuracy across diverse reasoning benchmarks.

adaptive inferencechain-of-thoughtefficiency

0 likes · 9 min read

Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?

DataFunTalk

Jun 17, 2025 · Artificial Intelligence

Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)

Kimi-Dev-72B, an open-source 72-billion-parameter code model from Moonshot AI, achieved a record 60.4% score on the SWE-bench Verified benchmark, surpassing larger models, and incorporates BugFixer/TestWriter dual roles, extensive mid-stage training on billions of GitHub data, and reinforcement-learning-driven self-play, with code available on Hugging Face and GitHub.

AISWE-benchSoftware Engineering

0 likes · 7 min read

Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)

Fighter's World

Jun 14, 2025 · Artificial Intelligence

How Can LLMs Learn to “Think” in Complex Industry Scenarios?

The article analyzes how large language models can acquire true reasoning abilities for hard‑to‑score industry tasks by combining Chain‑of‑Thought prompting with reinforcement learning, addressing vague reward signals, reward hacking, and loyalty, and proposing a toolbox of reward engineering, synthetic data, hierarchical RL and multi‑agent collaboration.

LLMReward Modelingchain-of-thought

0 likes · 22 min read

How Can LLMs Learn to “Think” in Complex Industry Scenarios?

Fun with Large Models

Jun 12, 2025 · Artificial Intelligence

Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B

This article explains the GRPO reinforcement‑learning algorithm, shows its core idea of internal group competition without a separate evaluator model, and provides a complete, step‑by‑step code walkthrough—including environment setup, dataset preparation, reward‑function design, training configuration, and evaluation—using the Qwen2.5‑0.5B‑Instruct model on the GSM8K math dataset.

GRPOGSM8KQwen2.5

0 likes · 23 min read

Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B

Kuaishou Tech

Jun 4, 2025 · Artificial Intelligence

KwaiCoder-AutoThink-preview: An Automatic‑Thinking Large Model Enhanced with Step‑SRPO Reinforcement Learning

The KwaiPilot team released the KwaiCoder‑AutoThink‑preview model, which introduces a novel automatic‑thinking training paradigm and a process‑supervised reinforcement‑learning method called Step‑SRPO, enabling the model to dynamically switch between thinking and non‑thinking modes, reduce inference cost, and achieve up to 20‑point gains on code and math benchmarks while handling large‑scale codebases.

AI researchCode GenerationModel Optimization

0 likes · 12 min read

KwaiCoder-AutoThink-preview: An Automatic‑Thinking Large Model Enhanced with Step‑SRPO Reinforcement Learning

Baobao Algorithm Notes

Jun 3, 2025 · Artificial Intelligence

Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research

An extensive analysis shows that a 1K‑sample fine‑tuning stage can replicate the generalization gains of thousands of reinforcement‑learning steps, explains the compressibility of RL, introduces a sample‑effect theory, and demonstrates that re‑distillation and small‑scale SFT dramatically improve LLM performance.

Re-distillationSample Effectlarge language models

0 likes · 23 min read

Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research

AI Frontier Lectures

May 31, 2025 · Artificial Intelligence

Why Embodied Intelligence Is Exploding and What It Means for the Future

The article analyzes the recent surge in embodied intelligence, examines why physical agents matter despite advances in large language models, outlines common failure modes, discusses key research decisions such as 2D versus 3D perception and tactile sensing, and explores the roles of imitation learning, VLA, and reinforcement learning in shaping the field.

RoboticsVLAVision

0 likes · 24 min read

Why Embodied Intelligence Is Exploding and What It Means for the Future

AI Frontier Lectures

May 30, 2025 · Artificial Intelligence

Can Diffusion Chains Unlock More Creative Reasoning in Large Language Models?

Recent work from West Lake University's MAPLE Lab introduces a diffusion‑based “Divergent Thought Chain” that treats each intermediate denoising step of a diffusion language model as a reasoning step, using result‑based reinforcement learning to optimize non‑linear token generation and achieving state‑of‑the‑art performance on math and code tasks.

Code Generationchain-of-thoughtdiffusion language models

0 likes · 14 min read

Can Diffusion Chains Unlock More Creative Reasoning in Large Language Models?

Alibaba Cloud Developer

May 28, 2025 · Artificial Intelligence

Unlocking LLM Fine‑Tuning: From Architecture to LoRA, DPO and Deployment

This article provides a comprehensive guide to large language model fine‑tuning, covering model architecture, parameter and memory calculations, prompt engineering, data construction, LoRA and PEFT techniques, reinforcement learning methods such as DPO, and practical deployment workflows on internal platforms.

Fine‑TuningLLMLoRA

0 likes · 21 min read

Unlocking LLM Fine‑Tuning: From Architecture to LoRA, DPO and Deployment

JD Cloud Developers

May 27, 2025 · Artificial Intelligence

How JD’s Young AI Engineers Tackle Real-World Model Challenges

Young JD algorithm engineers share how they solve tough AI problems—from optimizing large‑model training and reward‑model design for ad image generation, to building LLM‑based query expansion, agent evaluation, and model pruning with FFT and RDP—illustrating practical breakthroughs and personal growth in cutting‑edge AI research.

AIModel PruningReward Modeling

0 likes · 15 min read

How JD’s Young AI Engineers Tackle Real-World Model Challenges

AI Algorithm Path

May 27, 2025 · Artificial Intelligence

Reinforcement Learning Tutorial 8: Building State Feature Representations for Objective Optimization

This tutorial explains how to construct state feature vectors for reinforcement‑learning value‑function approximation, covering linear, polynomial, Fourier, and radial‑basis representations, as well as state aggregation techniques such as coarse coding and tile coding, and discusses non‑parametric approaches like kernel methods.

feature engineeringfourier basisfunction approximation

0 likes · 16 min read

Reinforcement Learning Tutorial 8: Building State Feature Representations for Objective Optimization

AIWalker

May 26, 2025 · Artificial Intelligence

VisionReasoner: RL‑Unified Model Beats YOLO‑World Detection, Segmentation, Counting

VisionReasoner presents a reinforcement‑learning‑driven unified framework that simultaneously tackles detection, segmentation, and counting tasks, employing a novel multi‑target cognition strategy and efficient Hungarian‑based matching, and demonstrates substantial gains—29.1% on COCO detection, 22.1% on ReasonSeg, and 15.3% on CountBench—using only 7,000 training samples.

SegmentationVisionReasonerVisual-Language Models

0 likes · 20 min read

VisionReasoner: RL‑Unified Model Beats YOLO‑World Detection, Segmentation, Counting

JD Tech

May 26, 2025 · Artificial Intelligence

Solving Technical Challenges at JD Retail: Multi‑Reward Models, LLM‑Based Query Expansion, Model Pruning, and Reinforcement Learning

This article details how JD Retail's young algorithm engineers tackled a series of AI engineering problems—including advertising image quality assessment with multi‑reward models, large‑language‑model‑driven query expansion, FFT‑and‑RDP‑based model pruning, and agent‑centric reinforcement learning—while sharing practical growth insights and code snippets.

AIComputer VisionModel Optimization

0 likes · 15 min read

Solving Technical Challenges at JD Retail: Multi‑Reward Models, LLM‑Based Query Expansion, Model Pruning, and Reinforcement Learning

Alibaba Cloud Developer

May 26, 2025 · Artificial Intelligence

How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training

This article examines Copilot 3.0’s planning module, explains how DeepSeek R1’s GRPO reinforcement‑learning pipeline enables flexible multi‑agent orchestration, addresses the limitations of Copilot 2.0, and presents experimental results that show a 61% reduction in reasoning length and a 9% relative gain in accuracy.

AIModel TrainingMulti-Agent

0 likes · 14 min read

How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training

AI Algorithm Path

May 25, 2025 · Artificial Intelligence

Reinforcement Learning Tutorial 7: Introducing Value Function Approximation Methods

This article explains why tabular reinforcement‑learning methods scale poorly, introduces supervised‑learning‑based value‑function approximation using a parameterized vector w, discusses loss design, stochastic‑gradient updates, bootstrapping, semi‑gradient techniques, and linear function approximation, and summarizes practical implications.

gradient Monte Carlolinear function approximationreinforcement learning

0 likes · 13 min read

Reinforcement Learning Tutorial 7: Introducing Value Function Approximation Methods

IT Services Circle

May 25, 2025 · Artificial Intelligence

DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview

The article provides a detailed technical overview of DeepSeek's flagship large language models, DeepSeek‑V3 and DeepSeek‑R1, describing their MoE architecture, training frameworks, reinforcement‑learning based fine‑tuning, inference optimizations, and the broader impact of these innovations on the AI landscape while also promoting related books and resources.

AIDeepSeekMixture of Experts

0 likes · 10 min read

DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview

AI Algorithm Path

May 24, 2025 · Artificial Intelligence

How N-step Temporal-Difference Methods Extend TD Learning in Reinforcement AI

This tutorial explains how n-step temporal‑difference (TD) algorithms generalize the one‑step TD and Monte‑Carlo methods, presents the n‑step return update rule, walks through a three‑step TD example, shows how Sarsa and Q‑learning can be extended, and discusses how to choose the optimal n value for a given problem.

Monte CarloQ-Learningalgorithm analysis

0 likes · 9 min read

How N-step Temporal-Difference Methods Extend TD Learning in Reinforcement AI

AI Algorithm Path

May 23, 2025 · Artificial Intelligence

Understanding Temporal‑Difference Algorithms in Reinforcement Learning

This tutorial explains temporal‑difference (TD) learning, compares it with dynamic programming and Monte‑Carlo methods, walks through concrete soccer‑match examples, shows one‑step TD versus constant‑α Monte‑Carlo updates, discusses convergence, bias, and introduces popular TD variants such as Sarsa, Q‑learning, Expected Sarsa and double learning.

Monte CarloQ-LearningTD learning

0 likes · 18 min read

Understanding Temporal‑Difference Algorithms in Reinforcement Learning

AI Algorithm Path

May 22, 2025 · Artificial Intelligence

Monte Carlo Policy Improvement in RL: Epsilon‑Greedy, On‑Policy vs Off‑Policy, and Incremental Updates

This tutorial explains how Monte Carlo methods are enhanced in reinforcement learning through epsilon‑greedy and epsilon‑soft policies, Monte Carlo control, a Blackjack Q‑function example, the distinction between on‑policy and off‑policy learning, importance sampling, and efficient incremental update techniques.

Epsilon-GreedyImportance SamplingMonte Carlo

0 likes · 14 min read

Monte Carlo Policy Improvement in RL: Epsilon‑Greedy, On‑Policy vs Off‑Policy, and Incremental Updates

AIWalker

May 22, 2025 · Artificial Intelligence

VisionReasoner: RL‑Unified System Beats YOLO‑World on Detection, Segmentation, Counting

VisionReasoner introduces a reinforcement‑learning‑driven unified framework that simultaneously handles detection, segmentation, and counting tasks within a single model, achieving 29.1% higher COCO detection AP, 22.1% better ReasonSeg segmentation, and 15.3% improvement on CountBench, while requiring only 7,000 training samples and offering efficient multi‑target matching via batch computation and the Hungarian algorithm.

LVLMObject CountingVisionReasoner

0 likes · 19 min read

VisionReasoner: RL‑Unified System Beats YOLO‑World on Detection, Segmentation, Counting

JD Tech Talk

May 22, 2025 · Artificial Intelligence

From Academic Research to Industrial Anti‑Fraud: Leveraging LLMs, Reinforcement Learning, and Model Distillation for Advertising Risk Detection

The article recounts Xiaoting’s journey from a PhD research background to leading JD.com’s ad‑fraud detection, detailing how large language models, reinforcement learning, and model distillation were applied to identify hidden address codes, reduce false‑positive rates to 0.3%, and balance accuracy with real‑time performance in a high‑traffic e‑commerce environment.

AIAd FraudAdvertising

0 likes · 11 min read

From Academic Research to Industrial Anti‑Fraud: Leveraging LLMs, Reinforcement Learning, and Model Distillation for Advertising Risk Detection

JD Retail Technology

May 22, 2025 · Industry Insights

Cracking Hidden Ad Fraud: JD’s AI‑Driven Anti‑Cheat System Explained

This article recounts the journey of a JD PhD trainee who transformed academic research on anomaly detection into a production‑grade, LLM‑enhanced anti‑fraud system that identifies concealed address codes in CPS ads, detailing model design, LoRA fine‑tuning, reinforcement learning, distillation, cost‑aware deployment, and lessons learned for scalable ad risk management.

ad fraud detectionindustry AIlarge language model

0 likes · 12 min read

Cracking Hidden Ad Fraud: JD’s AI‑Driven Anti‑Cheat System Explained

AI Algorithm Path

May 21, 2025 · Artificial Intelligence

Understanding Monte Carlo Algorithms for Reinforcement Learning with a Blackjack Case Study

This article explains Monte Carlo methods for reinforcement learning, compares model‑free and model‑based approaches, details V‑ and Q‑function estimation using a Blackjack example, and discusses exploration‑exploitation trade‑offs and practical advantages of MC algorithms.

BlackjackModel-freeMonte Carlo

0 likes · 13 min read

Understanding Monte Carlo Algorithms for Reinforcement Learning with a Blackjack Case Study

AI Algorithm Path

May 19, 2025 · Artificial Intelligence

Understanding Policy Evaluation and Improvement in Reinforcement Learning

This article explains how to solve Bellman equations, use iterative policy‑evaluation methods, apply the policy‑improvement theorem, and combine both steps in policy iteration, value iteration, and asynchronous variants, illustrated with a 5‑state example and a 4×4 gridworld.

Bellman equationGridWorldgeneralized policy iteration

0 likes · 15 min read

Understanding Policy Evaluation and Improvement in Reinforcement Learning

Amap Tech

May 19, 2025 · Artificial Intelligence

Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning

The article introduces Group Policy Gradient (GPG), a reinforcement‑learning framework that eliminates surrogate loss functions and critic models, directly optimizes the original objective, reduces bias and variance, and achieves state‑of‑the‑art performance on both single‑modal and multimodal tasks.

AI researchLLM fine-tuningbias reduction

0 likes · 7 min read

Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning

AI Algorithm Path

May 18, 2025 · Artificial Intelligence

Reinforcement Learning Tutorial Part 1: Core Concepts Explained

This article introduces the fundamental concepts of reinforcement learning, covering the agent‑environment interaction, key terminology, reward structures, task types, policies, value functions, the Bellman equations, and how optimal strategies are derived and approximated in practice.

Bellman equationMarkov Decision ProcessOptimal Policy

0 likes · 13 min read

Reinforcement Learning Tutorial Part 1: Core Concepts Explained

Kuaishou Tech

May 14, 2025 · Artificial Intelligence

StableReinforce and R1-Reward: Enhancing Multimodal Reward Models with Reinforcement Learning

This article presents StableReinforce and the R1-Reward model, demonstrating how reinforcement learning techniques can stabilize training and significantly improve the performance of multimodal reward models for large language models across several benchmarks.

AILLMR1-Reward

0 likes · 15 min read

StableReinforce and R1-Reward: Enhancing Multimodal Reward Models with Reinforcement Learning

Kuaishou Tech

May 13, 2025 · Artificial Intelligence

How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality

This article analyzes KuaiMod, a multimodal large‑model solution developed by Kuaishou for short‑video content quality assessment, detailing its benchmark dataset, chain‑of‑thought data construction, offline SFT + DPO training, online reinforcement‑learning updates, evaluation results, and large‑scale deployment impact.

BenchmarkKuaiModMultimodal AI

0 likes · 19 min read

How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality

AI Frontier Lectures

May 13, 2025 · Artificial Intelligence

How T2I‑R1 Boosts Text‑to‑Image Generation with Dual‑Level CoT Reasoning

Recent large language models have shown strong reasoning abilities, and this work extends chain‑of‑thought reasoning to autoregressive image generation by introducing T2I‑R1, a dual‑level (Semantic‑CoT and Token‑CoT) framework trained with reinforcement learning that unifies high‑level planning and low‑level token generation, achieving state‑of‑the‑art results.

generative AIreinforcement learningsemantic planning

0 likes · 7 min read

How T2I‑R1 Boosts Text‑to‑Image Generation with Dual‑Level CoT Reasoning

AI Frontier Lectures

May 13, 2025 · Artificial Intelligence

How Diffusion Policy is Transforming Vision‑Based Robot Motion Learning

This article provides a comprehensive, step‑by‑step analysis of Diffusion Policy for robot visuomotor control, covering its motivation, task characteristics, model design, dataset preparation, training pipeline, inference procedure, experimental results, and open research questions.

Roboticsdiffusion modelsmachine learning

0 likes · 63 min read

How Diffusion Policy is Transforming Vision‑Based Robot Motion Learning

Tencent Technical Engineering

May 12, 2025 · Artificial Intelligence

Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture

This article provides a detailed Chinese‑to‑English summary of Andrej Karpathy’s 7‑hour LLM tutorial, covering chat process analysis, tokenization, pre‑training data pipelines, model architecture, training strategies, post‑training fine‑tuning, reinforcement learning, chain‑of‑thought reasoning, and current industry applications.

AILLMModel architecture

0 likes · 25 min read

Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture

JD Retail Technology

May 7, 2025 · Artificial Intelligence

Solving Technical Challenges with Large AI Models at JD Retail: Reward Modeling, Query Expansion, and Model Pruning

JD Retail’s engineering team tackles hard AI problems by replacing a monolithic reward model with specialized small models for ad‑image generation, deploying an LLM‑driven query‑expansion pipeline that lifts conversion rates, and pruning text‑to‑image transformers using FFT and RDP to boost throughput 40% without loss, while building comprehensive evaluation tools and a semantic smart‑assistant.

AIModel PruningReward Modeling

0 likes · 14 min read

Solving Technical Challenges with Large AI Models at JD Retail: Reward Modeling, Query Expansion, and Model Pruning

AIWalker

May 6, 2025 · Artificial Intelligence

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

SimpleAR demonstrates that a vanilla autoregressive model with only 0.5 B parameters can generate high‑fidelity 1024×1024 images, covering pretraining, supervised fine‑tuning, and reinforcement learning, achieving competitive GenEval (0.59) and DPG‑Bench (79.66) scores while reducing inference time to about 14 seconds with vLLM and KV‑cache optimizations.

BenchmarkSupervised Fine‑Tuningautoregressive

0 likes · 14 min read

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

DevOps

May 5, 2025 · Artificial Intelligence

DeepSeek Releases Math‑Specialized Large Model V2 and ProverBench Evaluation Suite

DeepSeek has quietly open‑sourced a new mathematics‑focused large language model, DeepSeek‑Prover‑V2 (available in 671B and 7B variants), achieving 88.9% on MiniF2F and strong results on PutnamBench, alongside the high‑quality ProverBench dataset and a novel recursive theorem‑proving pipeline.

AIDeepSeekMathematical Reasoning

0 likes · 4 min read

DeepSeek Releases Math‑Specialized Large Model V2 and ProverBench Evaluation Suite

Architect

May 5, 2025 · Artificial Intelligence

How Agentic RAG‑R1 Turns Retrieval‑Augmented Generation into an Autonomous AI Agent

Agentic RAG‑R1, an open‑source project from Peking University, combines Retrieval‑Augmented Generation with an agentic AI loop, introduces the GRPO reinforcement‑learning optimizer, supports LoRA‑based fine‑tuning, quantization and multimodal tool calls, and demonstrates significant accuracy gains on the MedQA benchmark across both Chinese and English test sets.

Agentic AILLM Tool UseRetrieval Augmented Generation

0 likes · 8 min read

How Agentic RAG‑R1 Turns Retrieval‑Augmented Generation into an Autonomous AI Agent

AI Frontier Lectures

May 5, 2025 · Industry Insights

What Will Large Language Models Look Like in the Next Five Years? A Deep Dive into Trends and Challenges

The article reviews five years of AI model evolution, analyzes current scaling and reinforcement‑learning trends, and forecasts architectural, mathematical, and infrastructure directions for large language models through 2030, highlighting potential breakthroughs and the risks of over‑reliance on benchmarks.

AI trendsIndustry analysisModel Scaling

0 likes · 22 min read

What Will Large Language Models Look Like in the Next Five Years? A Deep Dive into Trends and Challenges

AI Algorithm Path

May 3, 2025 · Artificial Intelligence

DeepSeek Prover V2: Pioneering the Next Era of AI‑Driven Formal Math Reasoning

DeepSeek‑Prover‑V2, an open‑source LLM specialized for Lean 4, bridges intuitive high‑level reasoning and strict formal verification through sub‑goal decomposition, dual operation modes, and a novel cold‑start data pipeline, achieving state‑of‑the‑art results on MiniF2F, PutnamBench and CombiBench while highlighting trade‑offs in inference cost and model scalability.

AI mathematicsDeepSeek Prover V2LLM

0 likes · 18 min read

DeepSeek Prover V2: Pioneering the Next Era of AI‑Driven Formal Math Reasoning

Baobao Algorithm Notes

May 2, 2025 · Artificial Intelligence

Do Reinforcement Learning Techniques Really Boost LLM Reasoning? A Deep Dive into Recent Models

This article analyzes whether reinforcement learning enhances large language model reasoning, compares findings from DeepSeek-Math, a Tsinghua‑Shanghai Jiao‑Tong paper, and Qwen3, and outlines practical training pipelines—including Seed‑Thinking‑v1.5, DeepSeek‑R1, Kimi‑K1.5, and Qwen3—that aim to endow LLMs with robust reasoning capabilities.

LLMModel Trainingartificial intelligence

0 likes · 12 min read

Do Reinforcement Learning Techniques Really Boost LLM Reasoning? A Deep Dive into Recent Models

Mafengwo Technology

Apr 30, 2025 · Artificial Intelligence

How MaFengWo’s mfw-32B Travel LLM Outperforms DeepSeek‑R1 in Speed and Accuracy

The article details the development, training, and evaluation of MaFengWo's 32‑billion‑parameter travel large language model (mfw‑32B), highlighting its superior itinerary planning, personalized demand capture, budget management, and resource efficiency compared to DeepSeek‑R1, and describing the SFT and reinforcement‑learning stages that enabled these gains.

AI OptimizationLoRAModel Evaluation

0 likes · 14 min read

How MaFengWo’s mfw-32B Travel LLM Outperforms DeepSeek‑R1 in Speed and Accuracy

AIWalker

Apr 28, 2025 · Artificial Intelligence

SimpleAR: Autoregressive Visual Generation at 1024×1024 Using Only 0.5B Parameters

SimpleAR is a minimalist autoregressive visual generation framework that, with only 0.5 B parameters, achieves competitive 1024×1024 image synthesis through a three‑stage pipeline of large‑scale pretraining, supervised fine‑tuning, and GRPO‑based reinforcement learning, and demonstrates significant inference speedups using KV‑cache, vLLM, and speculative decoding.

BenchmarkInference Accelerationautoregressive generation

0 likes · 14 min read

SimpleAR: Autoregressive Visual Generation at 1024×1024 Using Only 0.5B Parameters

DataFunTalk

Apr 25, 2025 · Artificial Intelligence

Does Reinforcement Learning Really Expand Reasoning Capacity in Large Language Models? Insights from Recent Empirical Study

Recent empirical research by Tsinghua’s LeapLab and Shanghai Jiao Tong University reveals that reinforcement‑learning‑based fine‑tuning (RLVR) improves sampling efficiency but does not extend the fundamental reasoning abilities of large language models beyond their base capabilities, as demonstrated across mathematics, code, and visual reasoning benchmarks.

AI researchRLVRlarge language models

0 likes · 12 min read

Does Reinforcement Learning Really Expand Reasoning Capacity in Large Language Models? Insights from Recent Empirical Study

AntTech

Apr 24, 2025 · Artificial Intelligence

Key Takeaways from Ant Group and Tsinghua’s Presentations on the AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025

At ICLR 2025 in Singapore, Ant Group and Tsinghua University showcased the open‑source reinforcement‑learning platform AReaL and the multi‑agent system AWorld, highlighting their recent breakthroughs, system design challenges, performance results on the GAIA benchmark, and upcoming development plans.

AI frameworksICLR2025multi-agent systems

0 likes · 7 min read

Key Takeaways from Ant Group and Tsinghua’s Presentations on the AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025

Kuaishou Tech

Apr 24, 2025 · Artificial Intelligence

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

The article introduces SRPO, a two‑stage history‑resampling reinforcement‑learning framework that systematically tackles common GRPO training issues and achieves state‑of‑the‑art performance on both math and code benchmarks with far fewer training steps, while also revealing emergent self‑reflection behaviors in large language models.

LLM optimizationSRPOcross-domain training

0 likes · 12 min read

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

AI Frontier Lectures

Apr 24, 2025 · Artificial Intelligence

How d1 Boosts Reasoning in Diffusion LLMs with Reinforcement Learning

Researchers from UCLA and Meta AI introduce d1, a two‑stage post‑training framework that combines supervised fine‑tuning and a novel diffu‑GRPO reinforcement‑learning algorithm to enable efficient reasoning in masked diffusion large language models, achieving state‑of‑the‑art performance on multiple math and logic benchmarks.

AId1diffu-GRPO

0 likes · 9 min read

How d1 Boosts Reasoning in Diffusion LLMs with Reinforcement Learning

AI Frontier Lectures

Apr 24, 2025 · Artificial Intelligence

Why AI’s Second Half Is About Products, Not Just Models – A Deep Dive

The article argues that AI is entering a new phase where defining real‑world tasks and robust evaluation outweigh pure model improvements, highlighting the rise of reasoning‑augmented reinforcement learning, the need for product‑oriented thinking, and the shortcomings of current i.i.d. benchmark practices.

AI trendsindustry insightproduct focus

0 likes · 9 min read

Why AI’s Second Half Is About Products, Not Just Models – A Deep Dive

AntTech

Apr 21, 2025 · Artificial Intelligence

InclusionAI Community to Present AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025

The InclusionAI open‑source community, initiated by Ant Group, will showcase the latest advances of its reinforcement‑learning framework AReaL and multi‑agent framework AWorld at the ICLR 2025 conference in Singapore, highlighting performance breakthroughs, open‑source contributions, and industry‑focused AI research.

AReaLAWorldAnt Group

0 likes · 5 min read

InclusionAI Community to Present AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025

DataFunTalk

Apr 21, 2025 · Artificial Intelligence

Mechanize: A Controversial AI Startup Aiming to Fully Automate All Work and the Global Economy

Mechanize, a new AI startup founded by Epoch AI co‑founder Tamay Besiroglu, aims to fully automate all white‑collar work and the global economy, targeting a $60 trillion labor market, but faces technical hurdles, investor scrutiny, and widespread criticism over its radical vision.

AI automationAI startupsEconomic Impact

0 likes · 6 min read

Mechanize: A Controversial AI Startup Aiming to Fully Automate All Work and the Global Economy

AI Algorithm Path

Apr 20, 2025 · Artificial Intelligence

Boosting Visual Reasoning in VLMs with Reinforcement Learning

The article analyzes how reinforcement learning, which transformed LLM reasoning in DeepSeek, can be applied to visual‑language models to overcome the limitations of traditional chain‑of‑thought prompting and supervised fine‑tuning, presenting concrete reward designs, training pipelines, and a critical assessment of their strengths and weaknesses.

LLMRL trainingVisual-Language Models

0 likes · 10 min read

Boosting Visual Reasoning in VLMs with Reinforcement Learning

Fighter's World

Apr 18, 2025 · Artificial Intelligence

Rethinking the AGI Roadmap: From Data Imitation to Experience‑Driven Superiority

The article analyzes the emerging "Era of Experience" in AI, arguing that reliance on static human data limits progress and that reinforcement learning‑based experiential learning—exemplified by AlphaZero—offers a path toward surpassing human knowledge, while outlining the technical, safety, and ethical challenges ahead.

AGIAlphaZeroExperience Era

0 likes · 19 min read

Rethinking the AGI Roadmap: From Data Imitation to Experience‑Driven Superiority

AI Frontier Lectures

Apr 18, 2025 · Artificial Intelligence

From RL’s Early Days to Its Future: A Four‑Stage Evolution of Reinforcement Learning

This reflective essay traces reinforcement learning’s decade‑long evolution through four stages—early algorithmic foundations, application‑driven growth, problem‑construction focus, and speculative future—while critiquing the expanding definition and its impact on research and industry.

AI researchRL evolutionRLHF

0 likes · 9 min read

From RL’s Early Days to Its Future: A Four‑Stage Evolution of Reinforcement Learning

AI Frontier Lectures

Apr 17, 2025 · Artificial Intelligence

Why Reinforcement Learning Fails to Boost Small LLM Reasoning: A Deep Dive

This article analyzes a recent study on language‑model reasoning, revealing that reinforcement learning often brings little or no improvement, while evaluation variance caused by seeds, hardware, and decoding settings can dramatically affect benchmark results, and supervised fine‑tuning emerges as a more reliable path.

LLMReproducibilityreinforcement learning

0 likes · 12 min read

Why Reinforcement Learning Fails to Boost Small LLM Reasoning: A Deep Dive

Data Thinking Notes

Apr 15, 2025 · Artificial Intelligence

Understanding AI Agents: From Reinforcement Learning to LLM-Powered Planning

Professor Li Hongyi’s lecture provides a comprehensive, step‑by‑step exploration of AI agents, covering their definitions, reinforcement‑learning roots, LLM integration, memory mechanisms, tool usage, planning strategies, benchmarks, and practical examples, offering a valuable resource for anyone studying modern artificial intelligence.

AI agentsBenchmarkMemory

0 likes · 67 min read

Understanding AI Agents: From Reinforcement Learning to LLM-Powered Planning

Volcano Engine Developer Services

Apr 14, 2025 · Artificial Intelligence

Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs

ByteDance’s Doubao model team has open‑sourced Multi‑SWE‑bench, a multilingual benchmark covering seven major programming languages with 1,632 real‑world bug‑fix tasks, complete Docker environments, difficulty grading, and strict human validation, aiming to evaluate and advance large‑language‑model code‑repair capabilities beyond Python.

DatasetLLM BenchmarkSoftware Engineering

0 likes · 11 min read

Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs

AI Algorithm Path

Apr 13, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization for LLM Training

The article explains GRPO, a reinforcement‑learning algorithm that extends PPO with group sampling, no value network, dual penalties and KL regularisation, showing how it improves efficiency and stability when fine‑tuning large language models such as DeepSeek‑Math and DeepSeek‑R1.

DeepSeekGRPOPPO

0 likes · 6 min read

Understanding GRPO: Group Relative Policy Optimization for LLM Training

Network Intelligence Research Center (NIRC)

Apr 9, 2025 · Artificial Intelligence

Why Scaling Laws Fail for Video MLLMs: Uncovering the Temporal Hacking Problem

The article analyzes the anti‑scaling phenomenon in video large‑language models, identifies a “temporal hacking” shortcut where models focus on a few key frames, formalizes it via reward‑hacking theory, introduces the Temporal Perplexity (TPL) metric, and proposes an Unhackable Temporal Rewarding (UTR) framework to mitigate the issue.

Temporal PerplexityUTRreinforcement learning

0 likes · 14 min read

Why Scaling Laws Fail for Video MLLMs: Uncovering the Temporal Hacking Problem

AI Algorithm Path

Apr 2, 2025 · Artificial Intelligence

Vision‑Reasoning Model: Enabling LLMs to See and Think

The article analyzes the limitations of current visual language models and large reasoning models, proposes a combined Vision‑Reasoning Model (VRM), details its architecture using LLaVA, describes end‑to‑end fine‑tuning and reinforcement‑learning reward design, and argues that such models will become the next breakthrough in AI.

DeepSeekLLaVAVision-Language Model

0 likes · 9 min read

Vision‑Reasoning Model: Enabling LLMs to See and Think

Data Thinking Notes

Mar 30, 2025 · Artificial Intelligence

How DeepSeek‑R1 and Kimi‑K1.5 Push the Boundaries of Strong Reasoning Models

This comprehensive analysis by the Peking University AI Alignment team dissects the technical innovations behind DeepSeek‑R1, DeepSeek‑R1 Zero, and Kimi‑K1.5, covering reinforcement‑learning‑based post‑training, rule‑based rewards, GRPO optimization, scaling laws, multimodal extensions, safety challenges, and future research directions.

AI AlignmentDeepSeekKimi

0 likes · 57 min read

How DeepSeek‑R1 and Kimi‑K1.5 Push the Boundaries of Strong Reasoning Models

Fighter's World

Mar 29, 2025 · Industry Insights

A Year in AI: Key Insights from the Unsupervised Learning & Latent Space Podcast

The podcast recap dissects a year of rapid AI change, highlighting surprise‑fast open‑source model releases, shifting foundation‑model dynamics, the rise of GPT wrappers, over‑hyped agents, undervalued memory, product‑market fit debates, infrastructure opportunities, and lingering mysteries like RL in non‑verifiable domains.

AI InfrastructureAI trendsGPT wrappers

0 likes · 22 min read

A Year in AI: Key Insights from the Unsupervised Learning & Latent Space Podcast

JavaEdge

Mar 27, 2025 · Artificial Intelligence

Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)

This article examines the limitations of current vision‑language and reasoning models, proposes a visual reasoning model (VRM) that can process images and perform deep logical inference, and discusses architecture, training methods, reinforcement‑learning reward designs, and practical challenges.

Deep LearningLLMVision-Language Model

0 likes · 8 min read

Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)

JD Tech

Mar 26, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

The JD advertising team proposes a CTR‑driven advertising image generation framework (CAIG) that leverages multimodal large language models, a novel reward model, and product‑centric preference optimization to produce ad images with superior click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward modeladvertising image generation

0 likes · 10 min read

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

AI Frontier Lectures

Mar 24, 2025 · Artificial Intelligence

What Can AI Agents Learn from the Latest AIR 2025 Research?

The article compiles insights from the AIR 2025 conference and related talks, covering the evolution of agents from reinforcement‑learning to LLM‑driven systems, novel agent architectures like AIDE, GUI agents, natural‑language reinforcement learning, and scaling advances in large language models such as Qwen, while highlighting key algorithms, benchmarks, and open research questions.

AI agentsAgent ArchitectureGUI agents

0 likes · 27 min read

What Can AI Agents Learn from the Latest AIR 2025 Research?

JD Tech Talk

Mar 24, 2025 · Artificial Intelligence

MaRCA: Multi‑Agent Reinforcement Learning Computation Allocation for Full‑Chain Ad Serving

This article presents MaRCA, a multi‑agent reinforcement learning framework that allocates computation resources across the full ad‑serving chain by modeling user value, compute consumption, and action rewards, enabling fine‑grained power‑tilting toward high‑quality traffic and achieving significant business gains under strict latency constraints.

AI Optimizationad servingcomputation allocation

0 likes · 16 min read

MaRCA: Multi‑Agent Reinforcement Learning Computation Allocation for Full‑Chain Ad Serving

Architect

Mar 23, 2025 · Artificial Intelligence

The Future of AI Agents: From Prompt‑Driven Workflows to Model‑as‑Product and Reinforcement‑Learning‑Powered Agents

The article argues that the next wave of AI agents will shift from brittle, prompt‑driven workflows like Manus to truly autonomous, model‑centric agents trained with reinforcement learning and reasoning, exemplified by OpenAI's DeepResearch and Anthropic's Claude Sonnet 3.7, while the API‑driven market model collapses.

AI agentsClaudeDeepResearch

0 likes · 28 min read

The Future of AI Agents: From Prompt‑Driven Workflows to Model‑as‑Product and Reinforcement‑Learning‑Powered Agents

Baobao Algorithm Notes

Mar 20, 2025 · Artificial Intelligence

Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive

This comprehensive guide examines large‑scale deep reinforcement learning, detailing policy‑gradient fundamentals, the mathematics of PPO and GAE, practical implementation tricks, reward and observation normalization, network initialization, and the newer Phasic Policy Gradient method, all supported by code snippets and key research references.

Algorithm OptimizationDeep RLGAE

0 likes · 19 min read

Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive

Baobao Algorithm Notes

Mar 19, 2025 · Artificial Intelligence

Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?

The article explains why the GRPO loss in OpenR1 and trl starts at zero and then rises, detailing the underlying KL‑divergence formulation, the single‑step update mechanism, and how gradients are preserved despite a zero scalar loss, with code examples from the trl implementation.

GRPOLoss InitializationOpenR1

0 likes · 5 min read

Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?

JD Retail Technology

Mar 18, 2025 · Artificial Intelligence

Multi‑Agent Reinforcement Learning Based Full‑Chain Computation Allocation (MaRCA) for Advertising Systems

MaRCA, a multi‑agent reinforcement‑learning framework, allocates compute across JD’s advertising playback chain by jointly estimating user value, resource consumption, and action outcomes while dynamically adjusting to real‑time load, achieving roughly 15 % higher ad revenue without extra compute resources.

AdvertisingCompute SchedulingDeep Learning

0 likes · 18 min read

Multi‑Agent Reinforcement Learning Based Full‑Chain Computation Allocation (MaRCA) for Advertising Systems

Architect

Mar 17, 2025 · Artificial Intelligence

Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons

This article details a reinforcement‑learning experiment that teaches 7B‑ and 3B‑parameter language models to solve Sudoku, covering data preparation, GRPO‑based reward design, training configurations, performance comparisons, key insights, and future research directions.

GRPOModel Scalinglanguage models

0 likes · 15 min read

Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons

Data Thinking Notes

Mar 16, 2025 · Artificial Intelligence

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

DeepSeek‑R1 replaces the traditional PPO‑based RLHF approach with GRPO, reducing reliance on human‑labeled data by using pure reinforcement learning environments and carefully designed reward mechanisms; the article explains reinforcement learning fundamentals, compares PPO, DPO and GRPO, and offers practical application recommendations.

AI AlignmentDPOGRPO

0 likes · 14 min read

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

Architect

Mar 16, 2025 · Artificial Intelligence

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

This article walks through the complete lifecycle of building a small large‑language model, covering token‑level inference, pre‑training, post‑training steps such as supervised fine‑tuning, reward‑model creation, and reinforcement‑learning methods like DPO, PPO and GRPO, culminating in a practical 0.5B model fine‑tuned for chain‑of‑thought reasoning.

GRPOLLM trainingReward Modeling

0 likes · 22 min read

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

AI Algorithm Path

Mar 14, 2025 · Artificial Intelligence

Understanding Different Types of AI Agents: From Simple Reflex to Multi‑Agent Systems

This article introduces the main categories of AI agents—including simple reflex, model‑based, goal‑based, utility‑based, learning, hierarchical, and multi‑agent systems—explaining their operating principles, typical use cases, advantages, limitations, and providing concrete Python code examples for each.

AI agentsAgent TypesPython

0 likes · 19 min read

Understanding Different Types of AI Agents: From Simple Reflex to Multi‑Agent Systems

JD Retail Technology

Mar 14, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models

The paper presents CAIG, a CTR‑driven advertising image generation pipeline that pre‑trains a multimodal LLM on e‑commerce data, trains a reward model on CTR‑labeled image pairs, and fine‑tunes generation via product‑centric preference optimization, achieving state‑of‑the‑art online and offline performance.

AICTRad image generation

0 likes · 11 min read

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models

JD Tech Talk

Mar 13, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation with Multimodal Large Language Models

This paper proposes CAIG, a novel method for generating high-CTR advertising images using multimodal large language models, combining reinforcement learning and preference optimization to align generated content with product features.

CTR predictionadvertising image generationmultimodal large language models

0 likes · 10 min read

CTR-Driven Advertising Image Generation with Multimodal Large Language Models

NewBeeNLP

Mar 11, 2025 · Artificial Intelligence

How DeepSeek’s New Architecture Redefines LLM Efficiency and Performance

This article analyzes DeepSeek’s recent breakthroughs—including the Multi‑Head Latent Attention (MLA), Group Relative Policy Optimization (GRPO), and a refined Mixture‑of‑Experts design—along with its three‑stage training pipeline, RL‑only R1‑Zero variant, and benchmark comparisons against GPT‑4o‑Mini and Llama 3.1, highlighting both gains and remaining challenges.

DeepSeekLLMMixture of Experts

0 likes · 18 min read

Architect

Mar 9, 2025 · Artificial Intelligence

Experiments with Reinforcement Learning Fine‑Tuning of a 0.5B Qwen Model on the KK Dataset

The author reports a series of reinforcement‑learning‑based fine‑tuning experiments on a 0.5‑billion‑parameter Qwen‑0.5VB instruct model using the KK dataset, detailing reward design adjustments, curriculum‑style data scaling, observed convergence issues, and hypotheses about why small models fail to develop long reasoning chains.

LLM fine-tuningcurriculum learningreinforcement learning

0 likes · 11 min read

Experiments with Reinforcement Learning Fine‑Tuning of a 0.5B Qwen Model on the KK Dataset

Top Architect

Mar 9, 2025 · Artificial Intelligence

Alibaba Unveils Qwen QwQ-32B: A Compact Open‑Source LLM Rivaling DeepSeek

Alibaba has released the open‑source Qwen QwQ‑32B model, a 32‑billion‑parameter LLM that matches DeepSeek‑R1's performance while being deployable on consumer‑grade GPUs, and the announcement is accompanied by extensive promotional offers for AI‑related products and services.

AI BenchmarkAlibabaQwen

0 likes · 7 min read

Alibaba Unveils Qwen QwQ-32B: A Compact Open‑Source LLM Rivaling DeepSeek

DataFunTalk

Mar 7, 2025 · Artificial Intelligence

DeepSeek R1 Technical Report: Insights into Reasoning Models and Their Impact

This presentation reviews the development, technical details, and societal impact of DeepSeek's R1 model, explaining its reasoning capabilities, training pipeline, comparisons with other models, and future directions for AI research and product applications.

AI researchDeepSeekR1

0 likes · 53 min read

DeepSeek R1 Technical Report: Insights into Reasoning Models and Their Impact

Baobao Algorithm Notes

Mar 6, 2025 · Artificial Intelligence

Alibaba Unveils QwQ-32B: A 32‑Billion‑Parameter Inference Model with Agent Capabilities

Alibaba has open‑sourced its new QwQ‑32B inference model, a 32.5‑billion‑parameter transformer that rivals top models like DeepSeek‑R1 and o1‑mini, features integrated agent abilities for tool use and critical thinking, and offers a low inference barrier with extensive technical specifications and RL‑based training details.

AlibabaTransformeragent capabilities

0 likes · 4 min read

Alibaba Unveils QwQ-32B: A 32‑Billion‑Parameter Inference Model with Agent Capabilities

21CTO

Mar 5, 2025 · Artificial Intelligence

Why Barto and Sutton Won the 2024 Turing Award: The Rise of Reinforcement Learning

The ACM awarded Andrew Barto and Richard Sutton the 2024 Turing Award for pioneering reinforcement learning, detailing their seminal contributions, academic biographies, and urgent warnings about AI safety in a comprehensive overview.

Andrew BartoRichard SuttonTuring Award

0 likes · 7 min read

Why Barto and Sutton Won the 2024 Turing Award: The Rise of Reinforcement Learning

Ma Wei Says

Mar 4, 2025 · Artificial Intelligence

Microsoft’s Open‑Source Multimodal AI Agent Model Magma: Capabilities and Innovations

On February 25 2025, Microsoft open‑sourced its first multimodal AI agent foundation model, Magma, which extends multimodal processing to images, video, and text, introduces Set‑of‑Mark and Trace‑of‑Mark techniques for spatial‑temporal reasoning, optimizes modular inference for edge devices, and integrates reinforcement learning for adaptive task execution.

Edge ComputingMagmaMultimodal AI

0 likes · 6 min read

Microsoft’s Open‑Source Multimodal AI Agent Model Magma: Capabilities and Innovations

Architect

Mar 3, 2025 · Artificial Intelligence

Unlocking Reasoning LLMs: Methods, DeepSeek R1 Insights, and Cost‑Effective Strategies

This article examines how to build and improve reasoning‑capable large language models, explains the definition and use‑cases of reasoning models, details DeepSeek‑R1’s training pipeline, compares four key enhancement methods—including inference‑time scaling, pure RL, SFT + RL, and distillation—and offers budget‑friendly advice.

AI researchDeepSeekInference Scaling

0 likes · 27 min read

Unlocking Reasoning LLMs: Methods, DeepSeek R1 Insights, and Cost‑Effective Strategies

Architect

Feb 27, 2025 · Artificial Intelligence

Understanding Inference Large Language Models: DeepSeek‑R1 and the Rise of Test‑Time Computation

This article explains how inference‑oriented large language models such as DeepSeek‑R1 and OpenAI o1‑mini shift AI research from training‑time scaling to test‑time computation, detailing the underlying principles, new scaling laws, verification techniques, reinforcement‑learning pipelines, and practical methods for distilling reasoning capabilities into smaller models.

DeepSeek-R1Inferencelarge language models

0 likes · 18 min read

Understanding Inference Large Language Models: DeepSeek‑R1 and the Rise of Test‑Time Computation

ShiZhen AI

Feb 27, 2025 · Artificial Intelligence

Step‑by‑Step Guide: Build Your Own Lerobot SO‑ARM100 Robotic Arm from Scratch

This article walks you through the entire process of assembling a low‑cost Lerobot SO‑ARM100 6‑DOF robotic arm, configuring its Feetech servos, calibrating motion, adding dual cameras for teleoperation, collecting a dataset, and training a reinforcement‑learning policy locally or on cloud GPUs, with detailed troubleshooting tips and code examples.

LerobotPythonSO-ARM100

0 likes · 16 min read

Step‑by‑Step Guide: Build Your Own Lerobot SO‑ARM100 Robotic Arm from Scratch

Tencent Technical Engineering

Feb 26, 2025 · Artificial Intelligence

Engineers' Perspectives on DeepSeek: Technical Innovations and Implications

Thirteen engineers praise DeepSeek’s open‑source, reinforcement‑learning‑driven architecture—using FP8 storage and SFT‑free training—to deliver GPT‑4‑level reasoning at one‑twentieth the cost, enabling single‑GPU deployment, lowering barriers for academia and startups, and prompting notable market reactions that could democratize advanced AI.

AI cost reductionDeepSeekFP8

0 likes · 9 min read

Engineers' Perspectives on DeepSeek: Technical Innovations and Implications

DevOps

Feb 23, 2025 · Artificial Intelligence

Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications

This article explains how DeepSeek‑R1‑Zero uses group‑relative policy optimization (GRPO) to enhance inference without labeled data, introduces reinforcement learning with human feedback (RLHF) and its components, and compares the PPO and GRPO algorithms, highlighting their suitable engineering scenarios and practical implications for AI applications.

AI model trainingDeep LearningGRPO

0 likes · 15 min read

Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications

Architect

Feb 22, 2025 · Artificial Intelligence

How Open‑Source Projects Reproduced DeepSeek‑R1 and Pushed LLM Limits

This article reviews the most notable open‑source reproductions of DeepSeek‑R1—including Open R1, OpenThoughts, LIMO and DeepScaleR—detailing their data pipelines, training steps, reinforcement‑learning strategies, dataset constructions, and benchmark results that demonstrate how small, high‑quality data can rival massive‑scale models.

AI researchDeepSeek-R1Model Scaling

0 likes · 26 min read

How Open‑Source Projects Reproduced DeepSeek‑R1 and Pushed LLM Limits

Python Programming Learning Circle

Feb 20, 2025 · Artificial Intelligence

Building a StarCraft II AI Bot with DeepMind's pysc2 in Python

This article provides a step‑by‑step guide, complete with Python code examples, for creating a Protoss AI bot using DeepMind's pysc2 library to mine resources, construct buildings, train units, implement scouting, and execute attack strategies against increasingly difficult computer opponents in StarCraft II.

AI botDeepMindStarCraft II

0 likes · 28 min read

Building a StarCraft II AI Bot with DeepMind's pysc2 in Python

Architect's Alchemy Furnace

Feb 19, 2025 · Artificial Intelligence

DeepSeek’s Self‑Correction: Transforming AI Reliability and Safety

The article explores DeepSeek’s innovative self‑correction system—combining a Mixture‑of‑Experts architecture with reinforcement‑learning feedback—to achieve real‑time error detection, dynamic knowledge‑graph updates, and enhanced safety in high‑risk fields like autonomous driving and medical diagnostics.

AI SafetyDeepSeekMixture of Experts

0 likes · 9 min read

DeepSeek’s Self‑Correction: Transforming AI Reliability and Safety

Tencent Technical Engineering

Feb 19, 2025 · Artificial Intelligence

Reproduction and Analysis of DeepSeek R1/R1‑zero Reinforcement Learning Experiments

This note surveys four open‑source reproductions of DeepSeek R1/R1‑zero reinforcement‑learning pipelines, re‑implements their training on math and logic datasets using Qwen‑based models, shows that format‑plus‑accuracy rewards improve long‑chain reasoning though stability and scaling remain challenges, and outlines future directions for large‑scale RL and business deployment.

DeepSeek-R1large language modellong chain of thought

0 likes · 39 min read

Reproduction and Analysis of DeepSeek R1/R1‑zero Reinforcement Learning Experiments

AI Algorithm Path

Feb 18, 2025 · Artificial Intelligence

Build DeepSeek‑R1 from Scratch: Complete Training Process with Code Walkthrough

This article provides a step‑by‑step, code‑first guide to reproducing DeepSeek‑R1 from the ground up, covering model selection, dataset preparation, custom reward functions, GRPO reinforcement‑learning training, supervised fine‑tuning, reasoning‑oriented RL, rejection sampling, and model distillation.

DeepSeek-R1LLM trainingPython

0 likes · 48 min read

Build DeepSeek‑R1 from Scratch: Complete Training Process with Code Walkthrough

DataFunTalk

Feb 16, 2025 · Artificial Intelligence

Understanding Reasoning LLMs: DeepSeek R1 Variants, Inference‑Time Scaling, and Training Strategies

This article explains what reasoning language models are, outlines their strengths and weaknesses, details DeepSeek R1's three variants and their training pipelines—including pure reinforcement learning, SFT + RL, and distillation—while also discussing inference‑time scaling techniques and related research such as Sky‑T1 and TinyZero.

DeepSeekInference Scalingmodel distillation

0 likes · 16 min read

Understanding Reasoning LLMs: DeepSeek R1 Variants, Inference‑Time Scaling, and Training Strategies

JD Cloud Developers

Feb 13, 2025 · Artificial Intelligence

Unlocking DeepSeek R1: Concepts, Training Secrets, and Real-World Experiments

This article demystifies DeepSeek R1 by explaining key concepts such as online search integration and the R1 model, detailing its two‑phase training pipeline, core techniques like iterative data enhancement, and showcases practical reproductions, benchmark tests, and deployment examples for AI developers.

DeepSeekModel Trainingknowledge distillation

0 likes · 12 min read

Unlocking DeepSeek R1: Concepts, Training Secrets, and Real-World Experiments

AI Algorithm Path

Feb 12, 2025 · Artificial Intelligence

Essential DeepSeek‑R1 Reading List: Papers Behind the 2025 Hottest LLM

This article compiles a curated reading list of foundational and recent research papers—from the original Transformer to chain‑of‑thought, mixture‑of‑experts, and reinforcement‑learning studies—that together explain the breakthroughs behind DeepSeek‑R1 and guide readers through the technical evolution of modern large language models.

DeepSeekMixture of ExpertsResearch Papers

0 likes · 15 min read

Essential DeepSeek‑R1 Reading List: Papers Behind the 2025 Hottest LLM

Baobao Algorithm Notes

Feb 12, 2025 · Artificial Intelligence

How X‑R1 Triggers Aha Moments in Low‑Cost RL Training of 0.5B LLMs

The X‑R1 open‑source framework demonstrates that a 0.5B language model can achieve rapid reasoning improvements and observable "Aha Moments" using reinforcement learning on a modest 4‑GPU setup, detailing its design, performance metrics, installation steps, and future roadmap.

AILLMTraining Framework

0 likes · 6 min read

How X‑R1 Triggers Aha Moments in Low‑Cost RL Training of 0.5B LLMs

Architects' Tech Alliance

Feb 10, 2025 · Industry Insights

What Makes DeepSeek’s New V3 Model Rival GPT‑4o? A Deep Dive into Large‑Scale AI

This article explains what defines a large AI model, compares parameter scales of GPT‑3, GPT‑4 and M6, and analyzes DeepSeek’s recent releases—V3, R1, and Janus‑Pro—highlighting their benchmark performance, reinforcement‑learning techniques, and cost efficiency versus leading proprietary models.

AI BenchmarkDeepSeekModel Scaling

0 likes · 5 min read

What Makes DeepSeek’s New V3 Model Rival GPT‑4o? A Deep Dive into Large‑Scale AI

Big Data Technology Architecture

Feb 9, 2025 · Artificial Intelligence

Reproducing Deepseek RI Reasoning Ability with GRPO on Qwen2.5‑7B in Colab

This article explains how to replicate Deepseek RI's slow‑thinking inference using the GRPO reinforcement‑learning algorithm on the Qwen2.5‑7B model in a free Colab notebook, covering the underlying COT concept, reward‑function design, data preparation, training configuration, and observed results.

Fine-tuningGRPOLLM

0 likes · 14 min read

Reproducing Deepseek RI Reasoning Ability with GRPO on Qwen2.5‑7B in Colab

Top Architect

Feb 9, 2025 · Artificial Intelligence

DeepSeek‑R1: Training Pipeline, Reinforcement‑Learning Techniques, and Experimental Results

The article reviews DeepSeek‑R1’s training methodology—including cold‑start data collection, multi‑stage RL fine‑tuning, SFT data generation, and model distillation—highlights its performance comparable to OpenAI‑o1‑1217, and discusses key contributions, reward design, successful experiments, and failed attempts.

AI researchDeepSeekLLM

0 likes · 12 min read

DeepSeek‑R1: Training Pipeline, Reinforcement‑Learning Techniques, and Experimental Results