Tagged articles
43 articles
Page 1 of 1
Machine Heart
Machine Heart
May 13, 2026 · Artificial Intelligence

Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study

Recent research by Tsinghua and collaborators dissects On‑Policy Distillation for large language models, revealing that higher‑scoring teachers often fail to improve students unless their thinking patterns align, detailing token‑level overlap dynamics, failure cases, and two practical remedies to rescue ineffective distillation.

Model ScalingOn-Policy DistillationRL Post-Training
0 likes · 9 min read
Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study
IT Services Circle
IT Services Circle
May 1, 2026 · Artificial Intelligence

GPT’s Father Sends AI Back to 1930: An AI That Writes Python Without Seeing Code

Alec Radford’s team released Talkie, a 13‑billion‑parameter LLM trained exclusively on pre‑1931 texts (2600 billion tokens), which surprisingly can generate correct Python programs via few‑shot learning, demonstrating genuine reasoning rather than mere memorisation, and the article details its experiments, data‑quality challenges, comparative performance, and ambitious scaling roadmap.

Model ScalingOCR data qualityfew‑shot programming
0 likes · 8 min read
GPT’s Father Sends AI Back to 1930: An AI That Writes Python Without Seeing Code
SuanNi
SuanNi
Mar 17, 2026 · Artificial Intelligence

How Attention Residuals Boost Transformer Efficiency and Scale

The article presents the Attention Residuals architecture, explains how it replaces uniform residual addition with learned attention‑based aggregation, details full and block variants, engineering tricks for distributed training, and shows extensive scaling‑law experiments where the new design consistently improves validation loss and training efficiency across model sizes.

Attention ResidualsDeep LearningModel Scaling
0 likes · 13 min read
How Attention Residuals Boost Transformer Efficiency and Scale
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 17, 2026 · Artificial Intelligence

MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning

A new MIT paper reveals that pretrained large models already contain many hidden expert submodels, and that a simple one‑step Gaussian perturbation (RandOpt) can locate and ensemble these experts to achieve performance comparable to or better than traditional GRPO/PPO tuning, especially as model size grows.

GRPOModel ScalingPPO
0 likes · 9 min read
MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning
AI Engineer Programming
AI Engineer Programming
Mar 13, 2026 · Artificial Intelligence

Big Model vs. Big Harness: Who Really Powers AI Agents?

The article examines whether the success of AI agents stems from ever‑stronger large language models or from the surrounding harness—context management, tool orchestration, and reliability engineering—by comparing viewpoints, empirical evaluations, and practical guidance for developers.

AI AgentHarness EngineeringLLM
0 likes · 11 min read
Big Model vs. Big Harness: Who Really Powers AI Agents?
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 26, 2026 · Artificial Intelligence

From Search Ads to Foundation Models: My Journey Building the EvoCUA GUI Agent

The author explains why he transitioned from search advertising algorithms to foundation model research, outlines the four typical activities of base‑model teams, and shares detailed technical insights, experimental practices, and scaling strategies that led the EvoCUA GUI Agent to achieve open‑source SOTA on OSWorld.

AI researchGUI agentsModel Scaling
0 likes · 17 min read
From Search Ads to Foundation Models: My Journey Building the EvoCUA GUI Agent
Data Party THU
Data Party THU
Oct 11, 2025 · Artificial Intelligence

From Transformers to LLaMA 4: A Journey Through the Biggest LLMs

This article surveys the most influential large language models released since 2017, detailing the core innovations of Transformer, BERT, GPT series, T5, Retrieval‑Augmented Generation, and the latest LLaMA and Meta models, while highlighting their architectures, training paradigms, and impact on NLP research.

LLMModel Scalinglarge language models
0 likes · 21 min read
From Transformers to LLaMA 4: A Journey Through the Biggest LLMs
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 28, 2025 · Artificial Intelligence

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

This article breaks down the GPU memory requirements of large language models during training and inference, detailing the contributions of model weights, optimizer states, activations, KV cache, and activation recomputation, and provides concrete formulas, examples, and scaling insights for models like Qwen3 and DeepSeek V3.

GPU MemoryKV cacheLLM
0 likes · 18 min read
How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference
Aikesheng Open Source Community
Aikesheng Open Source Community
Aug 5, 2025 · Artificial Intelligence

How Model Size Shapes SQL Mastery: Insights from 1.5B‑32B LLMs

This article examines how the parameter count of large language models influences their ability to generate and understand SQL, comparing small (1.5B), medium (7B), and large (32B) models through a complex query case study, and highlights the trade‑offs between accuracy, reasoning depth, and resource consumption.

Model ScalingSQLartificial intelligence
0 likes · 14 min read
How Model Size Shapes SQL Mastery: Insights from 1.5B‑32B LLMs
DaTaobao Tech
DaTaobao Tech
Jul 16, 2025 · Artificial Intelligence

From GPT‑4 to Agentic AI: How LLM Architecture Evolved (2023‑2025)

Since GPT‑4’s 2023 debut, large language models have shifted from sheer scale to efficiency‑driven designs, advanced reasoning with chain‑of‑thought, and agentic tool use, as illustrated by MoE, MLA, and new attention mechanisms, reshaping benchmarks, commercial strategies, and the future of AI.

Agentic AILLMModel Scaling
0 likes · 24 min read
From GPT‑4 to Agentic AI: How LLM Architecture Evolved (2023‑2025)
Open Source Linux
Open Source Linux
Jun 12, 2025 · Artificial Intelligence

From Transformers to DeepSeek‑R1: The Evolution of Large Language Models (2017‑2025)

This article chronicles the rapid development of large language models from the 2017 Transformer breakthrough through the rise of BERT, GPT‑3, multimodal models, alignment techniques like RLHF, and finally the cost‑efficient DeepSeek‑R1 in 2025, highlighting key innovations, scaling trends, and real‑world impacts.

AI AlignmentDeep LearningModel Scaling
0 likes · 26 min read
From Transformers to DeepSeek‑R1: The Evolution of Large Language Models (2017‑2025)
Architect
Architect
May 18, 2025 · Artificial Intelligence

How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting

This article breaks down GPU memory consumption for large Transformer models, explains how to estimate each component—parameters, optimizer state, activations, gradients—and shows how parallelism, mixed precision, and recomputation strategies can dramatically reduce the footprint.

AI trainingGPU MemoryMemory Optimization
0 likes · 14 min read
How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting
AI Frontier Lectures
AI Frontier Lectures
May 5, 2025 · Industry Insights

What Will Large Language Models Look Like in the Next Five Years? A Deep Dive into Trends and Challenges

The article reviews five years of AI model evolution, analyzes current scaling and reinforcement‑learning trends, and forecasts architectural, mathematical, and infrastructure directions for large language models through 2030, highlighting potential breakthroughs and the risks of over‑reliance on benchmarks.

AI trendsIndustry analysisModel Scaling
0 likes · 22 min read
What Will Large Language Models Look Like in the Next Five Years? A Deep Dive into Trends and Challenges
AIWalker
AIWalker
Apr 17, 2025 · Artificial Intelligence

Unveiling DeepSeek’s Janus Series: Decoupled Visual Encoding for Unified Multimodal Understanding and Generation

This article provides an in‑depth analysis of DeepSeek’s Janus and Janus‑Pro models, explaining how decoupling visual encoding resolves the conflict between multimodal understanding and generation, detailing training stages, data scaling, architectural choices, and presenting extensive benchmark results that demonstrate significant performance gains.

BenchmarkDeepSeekJanus
0 likes · 23 min read
Unveiling DeepSeek’s Janus Series: Decoupled Visual Encoding for Unified Multimodal Understanding and Generation
AI Frontier Lectures
AI Frontier Lectures
Apr 13, 2025 · Artificial Intelligence

How SICOG Enables Self‑Evolving Multimodal Models with Zero‑Label Data

The paper introduces SICOG, a three‑stage collaborative framework that combines post‑training enhancement, inference optimization, and re‑pretraining with a self‑generated data loop, allowing large multimodal models to continuously improve without massive human‑annotated datasets, and demonstrates consistent gains across dozens of benchmarks.

Lifelong LearningModel Scalingchain‑of‑thought
0 likes · 12 min read
How SICOG Enables Self‑Evolving Multimodal Models with Zero‑Label Data
Data Thinking Notes
Data Thinking Notes
Apr 6, 2025 · Artificial Intelligence

Why Mixture of Experts (MoE) is Revolutionizing Large AI Models

Mixture of Experts (MoE) leverages dynamic conditional computation and specialized expert networks to overcome the parameter explosion and inefficiency of dense models, offering scalable capacity, multi‑task adaptability, and improved efficiency, while addressing challenges such as training stability, communication overhead, and load balancing.

Deep LearningMixture of ExpertsModel Scaling
0 likes · 7 min read
Why Mixture of Experts (MoE) is Revolutionizing Large AI Models
AIWalker
AIWalker
Mar 15, 2025 · Artificial Intelligence

How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA

SANA 1.5 introduces an efficient model‑growth pipeline, depth‑pruning, and inference‑time scaling that reuse a 1.6 B‑parameter foundation to train a 4.8 B model with 8× lower memory, 60 % less training time, and GenEval scores that rival or surpass much larger diffusion models.

Inference ScalingModel Scalingdiffusion
0 likes · 17 min read
How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA
Architect
Architect
Feb 22, 2025 · Artificial Intelligence

How Open‑Source Projects Reproduced DeepSeek‑R1 and Pushed LLM Limits

This article reviews the most notable open‑source reproductions of DeepSeek‑R1—including Open R1, OpenThoughts, LIMO and DeepScaleR—detailing their data pipelines, training steps, reinforcement‑learning strategies, dataset constructions, and benchmark results that demonstrate how small, high‑quality data can rival massive‑scale models.

AI researchDeepSeek-R1Model Scaling
0 likes · 26 min read
How Open‑Source Projects Reproduced DeepSeek‑R1 and Pushed LLM Limits
Architects' Tech Alliance
Architects' Tech Alliance
Feb 10, 2025 · Industry Insights

What Makes DeepSeek’s New V3 Model Rival GPT‑4o? A Deep Dive into Large‑Scale AI

This article explains what defines a large AI model, compares parameter scales of GPT‑3, GPT‑4 and M6, and analyzes DeepSeek’s recent releases—V3, R1, and Janus‑Pro—highlighting their benchmark performance, reinforcement‑learning techniques, and cost efficiency versus leading proprietary models.

AI BenchmarkDeepSeekModel Scaling
0 likes · 5 min read
What Makes DeepSeek’s New V3 Model Rival GPT‑4o? A Deep Dive into Large‑Scale AI
Open Source Linux
Open Source Linux
Feb 10, 2025 · Artificial Intelligence

How DeepSeek R1 Uses Large‑Scale Reinforcement Learning to Replicate OpenAI o1

This article examines DeepSeek R1’s large‑scale reinforcement‑learning approach, its training pipeline that combines rule‑based scaling and deep‑reasoning SFT data, and why its open‑source, low‑cost replication of OpenAI o1 marks a pivotal step toward more efficient, democratized AI models.

AI efficiencyDeepSeekModel Scaling
0 likes · 18 min read
How DeepSeek R1 Uses Large‑Scale Reinforcement Learning to Replicate OpenAI o1
Architects' Tech Alliance
Architects' Tech Alliance
Feb 9, 2025 · Artificial Intelligence

How DeepSeek R1 Replicates OpenAI o1 Using Large‑Scale Reinforcement Learning

The article provides an in‑depth technical analysis of DeepSeek R1, explaining how it reproduces OpenAI o1's reasoning abilities through rule‑based large‑scale reinforcement learning, mixed SFT data, and efficient scaling, while discussing its broader impact on AI model development and capability density trends.

AI industryCapability DensityDeepSeek
0 likes · 19 min read
How DeepSeek R1 Replicates OpenAI o1 Using Large‑Scale Reinforcement Learning
DataFunSummit
DataFunSummit
Feb 5, 2025 · Artificial Intelligence

Exploration and Practice of Large‑Model Data Construction

This presentation details engineering‑focused approaches to building, mixing, and filtering data for large language models, covering data preparation, pre‑training mix strategies such as DoReMi, DoGE and online sampling, post‑training data quality selection methods, and practical Q&A on scaling laws and PDF processing.

AIData MixingModel Scaling
0 likes · 15 min read
Exploration and Practice of Large‑Model Data Construction
Architect
Architect
Jan 29, 2025 · Artificial Intelligence

How Janus‑Pro Redefines Multimodal AI with Bigger Models and New Training Strategies

DeepSeek’s newly released Janus‑Pro series (1B and 7B) advances multimodal AI by decoupling visual understanding and generation, employing optimized three‑stage training, massive data expansion, and larger LLM backbones, achieving performance that matches or exceeds leading models such as Meta, Google, OpenAI, and Stability AI.

DeepSeekJanus-ProModel Scaling
0 likes · 6 min read
How Janus‑Pro Redefines Multimodal AI with Bigger Models and New Training Strategies
JavaEdge
JavaEdge
Nov 20, 2024 · Artificial Intelligence

7 Proven Strategies to Simplify Large Language Model Deployment

The article explains why deploying large language models is challenging and presents seven practical techniques—including defining deployment boundaries, model quantization, inference optimization, infrastructure consolidation, model replacement planning, GPU utilization, and using smaller models—to make LLM deployment more efficient and cost‑effective.

GPU OptimizationLLM deploymentModel Scaling
0 likes · 24 min read
7 Proven Strategies to Simplify Large Language Model Deployment
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 17, 2024 · Artificial Intelligence

How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights

Meta’s newly released 92‑page Movie Gen paper introduces a multimodal LLM that unifies text‑to‑image, text‑to‑video, personalized video, precise video editing, and audio generation, detailing its dual‑model architecture, training pipeline, temporal auto‑encoder design, scaling strategies, evaluation benchmark, and ablation studies.

Deep LearningModel ScalingVideo Generation
0 likes · 34 min read
How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights
Architect
Architect
May 5, 2024 · Artificial Intelligence

The Rise of Small Language Models (SLM) and Their Impact on AI Development

Amidst a growing trend that narrows performance gaps between large and small language models, researchers highlight the efficiency, adaptability, and specialized advantages of small language models (SLM), while also discussing the high costs, hallucinations, and security concerns that still challenge large‑scale LLMs.

AI efficiencyEdge ComputingLLM
0 likes · 9 min read
The Rise of Small Language Models (SLM) and Their Impact on AI Development
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Nov 3, 2023 · Artificial Intelligence

Can LLMs Master Lifelong Learning? Exploring MoE and Continuous Adaptation

This article explains how large language models can achieve continual lifelong learning, outlines the key properties required, reviews mixture‑of‑experts (MoE) techniques—including sparse MoE, GShard, Switch Transformer, GLaM and PanGu‑Sigma—and discusses the remaining challenges such as model complexity, expert balancing and distributed communication overhead.

LLMLifelong LearningMixture of Experts
0 likes · 9 min read
Can LLMs Master Lifelong Learning? Exploring MoE and Continuous Adaptation
DaTaobao Tech
DaTaobao Tech
Sep 11, 2023 · Artificial Intelligence

Large Language Model Upgrade Paths and Architecture Selection

This article analyzes upgrade paths of major LLMs—ChatGLM, LLaMA, Baichuan—detailing performance, context length, and architectural changes, then examines essential capabilities, data cleaning, tokenizer and attention design, and offers practical guidance for balanced scaling and efficient model construction.

BaichuanChatGLMLLM architecture
0 likes · 32 min read
Large Language Model Upgrade Paths and Architecture Selection
Volcano Engine Developer Services
Volcano Engine Developer Services
Sep 8, 2023 · Artificial Intelligence

Why Volcano Engine Says Multi‑Model Strategy Is the Future of AI

In this interview, Volcano Engine’s president Tan Dai explains how the rise of large models reshapes the AI landscape, why training thresholds now favor a pyramid of ultra‑large, medium‑size, and vertical models, and how a cloud‑first, multi‑model approach can address cost, security, and scalability challenges for enterprises.

AI strategyCost OptimizationModel Scaling
0 likes · 10 min read
Why Volcano Engine Says Multi‑Model Strategy Is the Future of AI
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jul 24, 2023 · Artificial Intelligence

Comprehensive Survey of Large Language Models: History, Key Technologies, Resources, and Future Directions

This article provides a detailed overview of large language models (LLMs), tracing their evolution from statistical and neural language models to modern pre‑trained transformers, discussing scaling, training, adaptation, utilization, evaluation methods, available resources, and outlining current challenges and future research directions.

Model ScalingPre‑trainingPrompt engineering
0 likes · 26 min read
Comprehensive Survey of Large Language Models: History, Key Technologies, Resources, and Future Directions
DataFunTalk
DataFunTalk
May 31, 2023 · Artificial Intelligence

Why GPT Can Exhibit Intelligence Through Next‑Token Prediction: A Comprehensive Exploration of Compression, Knowledge Circuits, and Model Scaling

This article examines the debate over whether large language models truly possess intelligence, arguing that next‑token prediction functions as a form of lossless data compression whose efficiency reflects intelligence, and it surveys research on knowledge extraction, neuron semantics, circuit competition, scaling effects, and the broader philosophical implications of GPT as a mirror of the world’s parameters.

GPTModel ScalingNext Token Prediction
0 likes · 59 min read
Why GPT Can Exhibit Intelligence Through Next‑Token Prediction: A Comprehensive Exploration of Compression, Knowledge Circuits, and Model Scaling
Architect
Architect
Apr 19, 2023 · Artificial Intelligence

Emergence in Large Language Models: Phenomena, Explanations, and Implications

This article reviews the emergence phenomena observed in large language models, explains how model scale, in‑context learning and chain‑of‑thought prompting contribute to sudden performance gains, discusses small‑model alternatives, and explores the relationship between emergence and the training‑time Grokking effect.

AI researchEmergenceIn-Context Learning
0 likes · 13 min read
Emergence in Large Language Models: Phenomena, Explanations, and Implications
Architect
Architect
Apr 14, 2023 · Artificial Intelligence

Overview of Prominent Large Language Models and Instruction Fine‑Tuning Techniques

The article surveys major large language models—including GPT‑3, T5, LaMDA, Jurassic‑1, MT‑NLG, Gopher, Chinchilla, PaLM, U‑PaLM, OPT, LLaMA, BLOOM, GLM‑130B, and ERNIE 3.0 Titan—explains their architectures, scaling trade‑offs, and then details instruction‑fine‑tuned variants such as T0, FLAN, GPT‑3.5, ChatGPT, GPT‑4, Alpaca and ChatGLM, providing references for further study.

AIChatGPTGPT-3
0 likes · 27 min read
Overview of Prominent Large Language Models and Instruction Fine‑Tuning Techniques
Architect
Architect
Feb 18, 2023 · Artificial Intelligence

Paradigm Shifts in Large Language Models: From Pre‑training to AGI and Future Research Directions

The article reviews the evolution of large language models, highlighting two major paradigm shifts after GPT‑3, the role of scaling laws, knowledge acquisition, prompting techniques, reasoning abilities, and outlines future research priorities for building more capable and efficient AI systems.

AI reasoningIn-Context LearningModel Scaling
0 likes · 71 min read
Paradigm Shifts in Large Language Models: From Pre‑training to AGI and Future Research Directions
DataFunTalk
DataFunTalk
Nov 22, 2022 · Artificial Intelligence

NVIDIA's Advances in Multi‑Role Generative Dialogue Modeling and Synthetic Data‑Driven QA

This article reviews NVIDIA's recent work on multi‑role generative dialogue modeling using GPT‑2‑based architectures and on enhancing question‑answering systems with synthetic data pipelines, covering model design, data preparation from Reddit, extensive experiments, scaling effects, and practical Q&A insights.

GPT-2Generative DialogueModel Scaling
0 likes · 17 min read
NVIDIA's Advances in Multi‑Role Generative Dialogue Modeling and Synthetic Data‑Driven QA
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 4, 2022 · Artificial Intelligence

How AI Platforms Turn Dreams into Reality: Scaling, Efficiency, and Usability

In this talk from the 2022 Yunqi Conference, Jia Yangqing explains how Alibaba's AI platform addresses efficiency, scale, and usability challenges by moving the Damo Academy to the cloud, open‑sourcing ModelScope, and delivering large‑model training, deployment, and inference services at massive scale.

AI EngineeringModel Scalingefficiency
0 likes · 10 min read
How AI Platforms Turn Dreams into Reality: Scaling, Efficiency, and Usability