Tagged articles

43 articles

Page 1 of 1

May 13, 2026 · Artificial Intelligence

Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study

Recent research by Tsinghua and collaborators dissects On‑Policy Distillation for large language models, revealing that higher‑scoring teachers often fail to improve students unless their thinking patterns align, detailing token‑level overlap dynamics, failure cases, and two practical remedies to rescue ineffective distillation.

Model ScalingOn-Policy DistillationRL Post-Training

0 likes · 9 min read

Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study

IT Services Circle

May 1, 2026 · Artificial Intelligence

GPT’s Father Sends AI Back to 1930: An AI That Writes Python Without Seeing Code

Alec Radford’s team released Talkie, a 13‑billion‑parameter LLM trained exclusively on pre‑1931 texts (2600 billion tokens), which surprisingly can generate correct Python programs via few‑shot learning, demonstrating genuine reasoning rather than mere memorisation, and the article details its experiments, data‑quality challenges, comparative performance, and ambitious scaling roadmap.

Model ScalingOCR data qualityfew‑shot programming

0 likes · 8 min read

GPT’s Father Sends AI Back to 1930: An AI That Writes Python Without Seeing Code

SuanNi

Mar 17, 2026 · Artificial Intelligence

How Attention Residuals Boost Transformer Efficiency and Scale

The article presents the Attention Residuals architecture, explains how it replaces uniform residual addition with learned attention‑based aggregation, details full and block variants, engineering tricks for distributed training, and shows extensive scaling‑law experiments where the new design consistently improves validation loss and training efficiency across model sizes.

Attention ResidualsDeep LearningModel Scaling

0 likes · 13 min read

How Attention Residuals Boost Transformer Efficiency and Scale

Machine Learning Algorithms & Natural Language Processing

Mar 17, 2026 · Artificial Intelligence

MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning

A new MIT paper reveals that pretrained large models already contain many hidden expert submodels, and that a simple one‑step Gaussian perturbation (RandOpt) can locate and ensemble these experts to achieve performance comparable to or better than traditional GRPO/PPO tuning, especially as model size grows.

GRPOModel ScalingPPO

0 likes · 9 min read

MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning

AI Engineer Programming

Mar 13, 2026 · Artificial Intelligence

Big Model vs. Big Harness: Who Really Powers AI Agents?

The article examines whether the success of AI agents stems from ever‑stronger large language models or from the surrounding harness—context management, tool orchestration, and reliability engineering—by comparing viewpoints, empirical evaluations, and practical guidance for developers.

AI AgentHarness EngineeringLLM

0 likes · 11 min read

Big Model vs. Big Harness: Who Really Powers AI Agents?

Baobao Algorithm Notes

Jan 26, 2026 · Artificial Intelligence

From Search Ads to Foundation Models: My Journey Building the EvoCUA GUI Agent

The author explains why he transitioned from search advertising algorithms to foundation model research, outlines the four typical activities of base‑model teams, and shares detailed technical insights, experimental practices, and scaling strategies that led the EvoCUA GUI Agent to achieve open‑source SOTA on OSWorld.

AI researchGUI agentsModel Scaling

0 likes · 17 min read

From Search Ads to Foundation Models: My Journey Building the EvoCUA GUI Agent

AI Insight Log

Jan 13, 2026 · Artificial Intelligence

Why Bigger LLMs Still Forget Facts – DeepSeek’s Engram Memory Module Explained

This article analyzes DeepSeek’s new Engram module, showing how conditional memory reduces the compute‑only approach of large language models, improves knowledge retrieval, reasoning, long‑context handling, and system efficiency while maintaining strict parameter and FLOP budgets.

AI ArchitectureDeepSeekEngram

0 likes · 15 min read

Why Bigger LLMs Still Forget Facts – DeepSeek’s Engram Memory Module Explained

Data Party THU

Oct 11, 2025 · Artificial Intelligence

From Transformers to LLaMA 4: A Journey Through the Biggest LLMs

This article surveys the most influential large language models released since 2017, detailing the core innovations of Transformer, BERT, GPT series, T5, Retrieval‑Augmented Generation, and the latest LLaMA and Meta models, while highlighting their architectures, training paradigms, and impact on NLP research.

LLMModel Scalinglarge language models

0 likes · 21 min read

From Transformers to LLaMA 4: A Journey Through the Biggest LLMs

Baobao Algorithm Notes

Sep 28, 2025 · Artificial Intelligence

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

This article breaks down the GPU memory requirements of large language models during training and inference, detailing the contributions of model weights, optimizer states, activations, KV cache, and activation recomputation, and provides concrete formulas, examples, and scaling insights for models like Qwen3 and DeepSeek V3.

GPU MemoryKV cacheLLM

0 likes · 18 min read

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

Aikesheng Open Source Community

Aug 5, 2025 · Artificial Intelligence

How Model Size Shapes SQL Mastery: Insights from 1.5B‑32B LLMs

This article examines how the parameter count of large language models influences their ability to generate and understand SQL, comparing small (1.5B), medium (7B), and large (32B) models through a complex query case study, and highlights the trade‑offs between accuracy, reasoning depth, and resource consumption.

Model ScalingSQLartificial intelligence

0 likes · 14 min read

How Model Size Shapes SQL Mastery: Insights from 1.5B‑32B LLMs

DaTaobao Tech

Jul 16, 2025 · Artificial Intelligence

From GPT‑4 to Agentic AI: How LLM Architecture Evolved (2023‑2025)

Since GPT‑4’s 2023 debut, large language models have shifted from sheer scale to efficiency‑driven designs, advanced reasoning with chain‑of‑thought, and agentic tool use, as illustrated by MoE, MLA, and new attention mechanisms, reshaping benchmarks, commercial strategies, and the future of AI.

Agentic AILLMModel Scaling

0 likes · 24 min read

From GPT‑4 to Agentic AI: How LLM Architecture Evolved (2023‑2025)

Open Source Linux

Jun 12, 2025 · Artificial Intelligence

From Transformers to DeepSeek‑R1: The Evolution of Large Language Models (2017‑2025)

This article chronicles the rapid development of large language models from the 2017 Transformer breakthrough through the rise of BERT, GPT‑3, multimodal models, alignment techniques like RLHF, and finally the cost‑efficient DeepSeek‑R1 in 2025, highlighting key innovations, scaling trends, and real‑world impacts.

AI AlignmentDeep LearningModel Scaling

0 likes · 26 min read

From Transformers to DeepSeek‑R1: The Evolution of Large Language Models (2017‑2025)

Architect

May 18, 2025 · Artificial Intelligence

How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting

This article breaks down GPU memory consumption for large Transformer models, explains how to estimate each component—parameters, optimizer state, activations, gradients—and shows how parallelism, mixed precision, and recomputation strategies can dramatically reduce the footprint.

AI trainingGPU MemoryMemory Optimization

0 likes · 14 min read

How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting

AI Frontier Lectures

May 5, 2025 · Industry Insights

What Will Large Language Models Look Like in the Next Five Years? A Deep Dive into Trends and Challenges

The article reviews five years of AI model evolution, analyzes current scaling and reinforcement‑learning trends, and forecasts architectural, mathematical, and infrastructure directions for large language models through 2030, highlighting potential breakthroughs and the risks of over‑reliance on benchmarks.

AI trendsIndustry analysisModel Scaling

0 likes · 22 min read

What Will Large Language Models Look Like in the Next Five Years? A Deep Dive into Trends and Challenges

AIWalker

Apr 17, 2025 · Artificial Intelligence

Unveiling DeepSeek’s Janus Series: Decoupled Visual Encoding for Unified Multimodal Understanding and Generation

This article provides an in‑depth analysis of DeepSeek’s Janus and Janus‑Pro models, explaining how decoupling visual encoding resolves the conflict between multimodal understanding and generation, detailing training stages, data scaling, architectural choices, and presenting extensive benchmark results that demonstrate significant performance gains.

BenchmarkDeepSeekJanus

0 likes · 23 min read

Unveiling DeepSeek’s Janus Series: Decoupled Visual Encoding for Unified Multimodal Understanding and Generation

AI Frontier Lectures

Apr 13, 2025 · Artificial Intelligence

How SICOG Enables Self‑Evolving Multimodal Models with Zero‑Label Data

The paper introduces SICOG, a three‑stage collaborative framework that combines post‑training enhancement, inference optimization, and re‑pretraining with a self‑generated data loop, allowing large multimodal models to continuously improve without massive human‑annotated datasets, and demonstrates consistent gains across dozens of benchmarks.

Lifelong LearningModel Scalingchain‑of‑thought

0 likes · 12 min read

How SICOG Enables Self‑Evolving Multimodal Models with Zero‑Label Data

AIWalker

Apr 7, 2025 · Artificial Intelligence

Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision

A recent study by LeCun, Xie, and collaborators shows that large‑scale visual self‑supervised learning (Web‑SSL) can match or surpass CLIP on diverse VQA tasks, even without any language supervision, by scaling model size and data volume.

CLIPModel ScalingVQA

0 likes · 13 min read

Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision

Data Thinking Notes

Apr 6, 2025 · Artificial Intelligence

Why Mixture of Experts (MoE) is Revolutionizing Large AI Models

Mixture of Experts (MoE) leverages dynamic conditional computation and specialized expert networks to overcome the parameter explosion and inefficiency of dense models, offering scalable capacity, multi‑task adaptability, and improved efficiency, while addressing challenges such as training stability, communication overhead, and load balancing.

Deep LearningMixture of ExpertsModel Scaling

0 likes · 7 min read

Why Mixture of Experts (MoE) is Revolutionizing Large AI Models

Architect

Mar 17, 2025 · Artificial Intelligence

Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons

This article details a reinforcement‑learning experiment that teaches 7B‑ and 3B‑parameter language models to solve Sudoku, covering data preparation, GRPO‑based reward design, training configurations, performance comparisons, key insights, and future research directions.

GRPOModel Scalinglanguage models

0 likes · 15 min read

Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons

AIWalker

Mar 15, 2025 · Artificial Intelligence

How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA

SANA 1.5 introduces an efficient model‑growth pipeline, depth‑pruning, and inference‑time scaling that reuse a 1.6 B‑parameter foundation to train a 4.8 B model with 8× lower memory, 60 % less training time, and GenEval scores that rival or surpass much larger diffusion models.

Inference ScalingModel Scalingdiffusion

0 likes · 17 min read

How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA

Architect

Feb 22, 2025 · Artificial Intelligence

How Open‑Source Projects Reproduced DeepSeek‑R1 and Pushed LLM Limits

This article reviews the most notable open‑source reproductions of DeepSeek‑R1—including Open R1, OpenThoughts, LIMO and DeepScaleR—detailing their data pipelines, training steps, reinforcement‑learning strategies, dataset constructions, and benchmark results that demonstrate how small, high‑quality data can rival massive‑scale models.

AI researchDeepSeek-R1Model Scaling

0 likes · 26 min read

How Open‑Source Projects Reproduced DeepSeek‑R1 and Pushed LLM Limits

AIWalker

Feb 15, 2025 · Artificial Intelligence

Janus-Pro Unveiled: A Unified Architecture for Multimodal Understanding and Generation

Janus-Pro, the open‑source successor to Janus, introduces a decoupled visual encoder and scaled training data to boost both multimodal understanding and text‑to‑image generation, achieving state‑of‑the‑art results on benchmarks such as GQA, GenEval and DPG‑Bench.

Janus-ProModel ScalingMultimodal AI

0 likes · 13 min read

Janus-Pro Unveiled: A Unified Architecture for Multimodal Understanding and Generation

Architects' Tech Alliance

Feb 10, 2025 · Industry Insights

What Makes DeepSeek’s New V3 Model Rival GPT‑4o? A Deep Dive into Large‑Scale AI

This article explains what defines a large AI model, compares parameter scales of GPT‑3, GPT‑4 and M6, and analyzes DeepSeek’s recent releases—V3, R1, and Janus‑Pro—highlighting their benchmark performance, reinforcement‑learning techniques, and cost efficiency versus leading proprietary models.

AI BenchmarkDeepSeekModel Scaling

0 likes · 5 min read

What Makes DeepSeek’s New V3 Model Rival GPT‑4o? A Deep Dive into Large‑Scale AI

Open Source Linux

Feb 10, 2025 · Artificial Intelligence

How DeepSeek R1 Uses Large‑Scale Reinforcement Learning to Replicate OpenAI o1

This article examines DeepSeek R1’s large‑scale reinforcement‑learning approach, its training pipeline that combines rule‑based scaling and deep‑reasoning SFT data, and why its open‑source, low‑cost replication of OpenAI o1 marks a pivotal step toward more efficient, democratized AI models.

AI efficiencyDeepSeekModel Scaling

0 likes · 18 min read

How DeepSeek R1 Uses Large‑Scale Reinforcement Learning to Replicate OpenAI o1

Architects' Tech Alliance

Feb 9, 2025 · Artificial Intelligence

How DeepSeek R1 Replicates OpenAI o1 Using Large‑Scale Reinforcement Learning

The article provides an in‑depth technical analysis of DeepSeek R1, explaining how it reproduces OpenAI o1's reasoning abilities through rule‑based large‑scale reinforcement learning, mixed SFT data, and efficient scaling, while discussing its broader impact on AI model development and capability density trends.

AI industryCapability DensityDeepSeek

0 likes · 19 min read

How DeepSeek R1 Replicates OpenAI o1 Using Large‑Scale Reinforcement Learning

DataFunSummit

Feb 5, 2025 · Artificial Intelligence

Exploration and Practice of Large‑Model Data Construction

This presentation details engineering‑focused approaches to building, mixing, and filtering data for large language models, covering data preparation, pre‑training mix strategies such as DoReMi, DoGE and online sampling, post‑training data quality selection methods, and practical Q&A on scaling laws and PDF processing.

AIData MixingModel Scaling

0 likes · 15 min read

Exploration and Practice of Large‑Model Data Construction

Architect

Jan 29, 2025 · Artificial Intelligence

How Janus‑Pro Redefines Multimodal AI with Bigger Models and New Training Strategies

DeepSeek’s newly released Janus‑Pro series (1B and 7B) advances multimodal AI by decoupling visual understanding and generation, employing optimized three‑stage training, massive data expansion, and larger LLM backbones, achieving performance that matches or exceeds leading models such as Meta, Google, OpenAI, and Stability AI.

DeepSeekJanus-ProModel Scaling

0 likes · 6 min read

How Janus‑Pro Redefines Multimodal AI with Bigger Models and New Training Strategies

JavaEdge

Nov 20, 2024 · Artificial Intelligence

7 Proven Strategies to Simplify Large Language Model Deployment

The article explains why deploying large language models is challenging and presents seven practical techniques—including defining deployment boundaries, model quantization, inference optimization, infrastructure consolidation, model replacement planning, GPU utilization, and using smaller models—to make LLM deployment more efficient and cost‑effective.

GPU OptimizationLLM deploymentModel Scaling

0 likes · 24 min read

7 Proven Strategies to Simplify Large Language Model Deployment

Baobao Algorithm Notes

Oct 17, 2024 · Artificial Intelligence

How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights

Meta’s newly released 92‑page Movie Gen paper introduces a multimodal LLM that unifies text‑to‑image, text‑to‑video, personalized video, precise video editing, and audio generation, detailing its dual‑model architecture, training pipeline, temporal auto‑encoder design, scaling strategies, evaluation benchmark, and ablation studies.

Deep LearningModel ScalingVideo Generation

0 likes · 34 min read

How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights

Architect

May 5, 2024 · Artificial Intelligence

The Rise of Small Language Models (SLM) and Their Impact on AI Development

Amidst a growing trend that narrows performance gaps between large and small language models, researchers highlight the efficiency, adaptability, and specialized advantages of small language models (SLM), while also discussing the high costs, hallucinations, and security concerns that still challenge large‑scale LLMs.

AI efficiencyEdge ComputingLLM

0 likes · 9 min read

The Rise of Small Language Models (SLM) and Their Impact on AI Development

Baobao Algorithm Notes

Apr 27, 2024 · Artificial Intelligence

Qwen1.5-110B vs Llama‑3‑70B: Performance Insights of Alibaba’s 110B Model

Alibaba unveiled the 110‑billion‑parameter Qwen1.5‑110B model, featuring GQA, 32k context and multilingual support, and benchmark results show it matches or surpasses Llama‑3‑70B and Mixtral‑8x22B on a range of tasks, with notable gains in chat evaluations.

AILLMModel Scaling

0 likes · 7 min read

Qwen1.5-110B vs Llama‑3‑70B: Performance Insights of Alibaba’s 110B Model

Huawei Cloud Developer Alliance

Nov 3, 2023 · Artificial Intelligence

Can LLMs Master Lifelong Learning? Exploring MoE and Continuous Adaptation

This article explains how large language models can achieve continual lifelong learning, outlines the key properties required, reviews mixture‑of‑experts (MoE) techniques—including sparse MoE, GShard, Switch Transformer, GLaM and PanGu‑Sigma—and discusses the remaining challenges such as model complexity, expert balancing and distributed communication overhead.

LLMLifelong LearningMixture of Experts

0 likes · 9 min read

Can LLMs Master Lifelong Learning? Exploring MoE and Continuous Adaptation

DaTaobao Tech

Sep 11, 2023 · Artificial Intelligence

Large Language Model Upgrade Paths and Architecture Selection

This article analyzes upgrade paths of major LLMs—ChatGLM, LLaMA, Baichuan—detailing performance, context length, and architectural changes, then examines essential capabilities, data cleaning, tokenizer and attention design, and offers practical guidance for balanced scaling and efficient model construction.

BaichuanChatGLMLLM architecture

0 likes · 32 min read

Large Language Model Upgrade Paths and Architecture Selection

Volcano Engine Developer Services

Sep 8, 2023 · Artificial Intelligence

Why Volcano Engine Says Multi‑Model Strategy Is the Future of AI

In this interview, Volcano Engine’s president Tan Dai explains how the rise of large models reshapes the AI landscape, why training thresholds now favor a pyramid of ultra‑large, medium‑size, and vertical models, and how a cloud‑first, multi‑model approach can address cost, security, and scalability challenges for enterprises.

AI strategyCost OptimizationModel Scaling

0 likes · 10 min read

Why Volcano Engine Says Multi‑Model Strategy Is the Future of AI

Rare Earth Juejin Tech Community

Jul 24, 2023 · Artificial Intelligence

Comprehensive Survey of Large Language Models: History, Key Technologies, Resources, and Future Directions

This article provides a detailed overview of large language models (LLMs), tracing their evolution from statistical and neural language models to modern pre‑trained transformers, discussing scaling, training, adaptation, utilization, evaluation methods, available resources, and outlining current challenges and future research directions.

Model ScalingPre‑trainingPrompt engineering

0 likes · 26 min read

Comprehensive Survey of Large Language Models: History, Key Technologies, Resources, and Future Directions

DataFunTalk

May 31, 2023 · Artificial Intelligence

Why GPT Can Exhibit Intelligence Through Next‑Token Prediction: A Comprehensive Exploration of Compression, Knowledge Circuits, and Model Scaling

This article examines the debate over whether large language models truly possess intelligence, arguing that next‑token prediction functions as a form of lossless data compression whose efficiency reflects intelligence, and it surveys research on knowledge extraction, neuron semantics, circuit competition, scaling effects, and the broader philosophical implications of GPT as a mirror of the world’s parameters.

GPTModel ScalingNext Token Prediction

0 likes · 59 min read

Why GPT Can Exhibit Intelligence Through Next‑Token Prediction: A Comprehensive Exploration of Compression, Knowledge Circuits, and Model Scaling

Architect

Apr 19, 2023 · Artificial Intelligence

Emergence in Large Language Models: Phenomena, Explanations, and Implications

This article reviews the emergence phenomena observed in large language models, explains how model scale, in‑context learning and chain‑of‑thought prompting contribute to sudden performance gains, discusses small‑model alternatives, and explores the relationship between emergence and the training‑time Grokking effect.

AI researchEmergenceIn-Context Learning

0 likes · 13 min read

Emergence in Large Language Models: Phenomena, Explanations, and Implications

Architect

Apr 14, 2023 · Artificial Intelligence

Overview of Prominent Large Language Models and Instruction Fine‑Tuning Techniques

The article surveys major large language models—including GPT‑3, T5, LaMDA, Jurassic‑1, MT‑NLG, Gopher, Chinchilla, PaLM, U‑PaLM, OPT, LLaMA, BLOOM, GLM‑130B, and ERNIE 3.0 Titan—explains their architectures, scaling trade‑offs, and then details instruction‑fine‑tuned variants such as T0, FLAN, GPT‑3.5, ChatGPT, GPT‑4, Alpaca and ChatGLM, providing references for further study.

AIChatGPTGPT-3

0 likes · 27 min read

Overview of Prominent Large Language Models and Instruction Fine‑Tuning Techniques

Tencent Cloud Developer

Mar 16, 2023 · Artificial Intelligence

What Makes GPT‑4 a Game‑Changer? 10 Expert Insights on Its Capabilities and Impact

This article provides a detailed analysis of GPT‑4, covering its multimodal abilities, performance gains, training innovations, safety improvements, new application scenarios, impact on developers, and future trends in large language models.

AI SafetyGPT-4LLM trends

0 likes · 16 min read

What Makes GPT‑4 a Game‑Changer? 10 Expert Insights on Its Capabilities and Impact

Architect

Feb 18, 2023 · Artificial Intelligence

Paradigm Shifts in Large Language Models: From Pre‑training to AGI and Future Research Directions

The article reviews the evolution of large language models, highlighting two major paradigm shifts after GPT‑3, the role of scaling laws, knowledge acquisition, prompting techniques, reasoning abilities, and outlines future research priorities for building more capable and efficient AI systems.

AI reasoningIn-Context LearningModel Scaling

0 likes · 71 min read

Paradigm Shifts in Large Language Models: From Pre‑training to AGI and Future Research Directions

DataFunTalk

Feb 10, 2023 · Artificial Intelligence

ChatGPT: A Revolutionary Breakthrough, Its Core Capabilities, and Impact on Investment Research

This article analyzes why ChatGPT represents a revolutionary advance in AI, explores its emergent abilities and code‑training advantages, evaluates its practical value for investment research through real‑world comparisons with experts, and discusses future trends and challenges for large language models.

AIChatGPTCode Training

0 likes · 16 min read

ChatGPT: A Revolutionary Breakthrough, Its Core Capabilities, and Impact on Investment Research

DataFunTalk

Nov 22, 2022 · Artificial Intelligence

NVIDIA's Advances in Multi‑Role Generative Dialogue Modeling and Synthetic Data‑Driven QA

This article reviews NVIDIA's recent work on multi‑role generative dialogue modeling using GPT‑2‑based architectures and on enhancing question‑answering systems with synthetic data pipelines, covering model design, data preparation from Reddit, extensive experiments, scaling effects, and practical Q&A insights.

GPT-2Generative DialogueModel Scaling

0 likes · 17 min read

NVIDIA's Advances in Multi‑Role Generative Dialogue Modeling and Synthetic Data‑Driven QA

Alibaba Cloud Big Data AI Platform

Nov 4, 2022 · Artificial Intelligence

How AI Platforms Turn Dreams into Reality: Scaling, Efficiency, and Usability

In this talk from the 2022 Yunqi Conference, Jia Yangqing explains how Alibaba's AI platform addresses efficiency, scale, and usability challenges by moving the Damo Academy to the cloud, open‑sourcing ModelScope, and delivering large‑model training, deployment, and inference services at massive scale.

AI EngineeringModel Scalingefficiency

0 likes · 10 min read

How AI Platforms Turn Dreams into Reality: Scaling, Efficiency, and Usability