Tagged articles

Model Efficiency

32 articles · Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 21, 2026 · Artificial Intelligence

xOPD Evolution: Mapping Recent OPD Improvements – Rephrased Same Problems vs. New Modules

This article surveys the latest on‑policy distillation (OPD) research, categorizing each work as either a reinterpretation of an existing problem or a modification of a different module, and highlights the experimental findings, design choices, and trade‑offs reported across the papers.

LLMModel EfficiencyOPD
0 likes · 31 min read
xOPD Evolution: Mapping Recent OPD Improvements – Rephrased Same Problems vs. New Modules
SuanNi
SuanNi
May 28, 2026 · Artificial Intelligence

How a 3.8B Model Beats 6B+ Models Using Just 20% of the Compute – Inside Microsoft Lens

Microsoft’s Lens team shows that a 3.8 B‑parameter image‑generation model can match or surpass 6 B‑plus models while consuming only about 19 % of the GPU compute, thanks to aggressive model compression, dense captioning, mixed‑resolution training, optimized VAE and language encoders, and targeted RL fine‑tuning.

BenchmarkingModel Efficiencydense captioning
0 likes · 14 min read
How a 3.8B Model Beats 6B+ Models Using Just 20% of the Compute – Inside Microsoft Lens
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 25, 2026 · Artificial Intelligence

Next-ToBE: Enabling Overconfident LLMs to See Further and Reason More Accurately

The ICLR 2026 paper introduces Next‑ToBE, a training‑objective modification that replaces the one‑hot next‑token label with a soft distribution over a future token window, unlocking latent foresight in LLMs, improving future‑token hit rate, downstream reasoning performance, and reducing training memory and time.

Future Token PredictionLarge Language ModelsModel Efficiency
0 likes · 12 min read
Next-ToBE: Enabling Overconfident LLMs to See Further and Reason More Accurately
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 21, 2026 · Artificial Intelligence

Can a New Training Objective Make LLMs See Further and Reason Better?

The paper introduces Next‑ToBE, a training‑objective modification that replaces the one‑hot next‑token label with a soft distribution covering a future token window, thereby activating latent anticipatory capacity in large language models and yielding significant gains in token‑hit rates, reasoning accuracy, and training efficiency.

Anticipatory CapacityLarge Language ModelsModel Efficiency
0 likes · 11 min read
Can a New Training Objective Make LLMs See Further and Reason Better?
Machine Heart
Machine Heart
May 17, 2026 · Artificial Intelligence

The Hidden Token Bill of AI Coding Agents: Why More Tokens Don’t Guarantee Better Results

An analysis of eight frontier coding agents shows that token consumption in agentic coding tasks is highly variable, often orders of magnitude higher than simple code reasoning, and that spending more tokens does not reliably improve accuracy, with significant differences across models and limited predictability of costs.

AI agentsModel EfficiencyToken consumption
0 likes · 11 min read
The Hidden Token Bill of AI Coding Agents: Why More Tokens Don’t Guarantee Better Results
Architects' Tech Alliance
Architects' Tech Alliance
May 8, 2026 · Artificial Intelligence

Token Fundamentals: A Technical Panorama of AI Language Units

Tokens are the smallest language building blocks that AI models process, representing characters, words, subwords, punctuation or emojis; they determine context window size and generation speed, so tokenization directly impacts model understanding accuracy and efficiency, as explained in the 2026 Token Report.

AI FundamentalsLanguage ModelsModel Efficiency
0 likes · 4 min read
Token Fundamentals: A Technical Panorama of AI Language Units
Machine Heart
Machine Heart
Apr 27, 2026 · Artificial Intelligence

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

The paper presents a systematic empirical study that derives a power‑law scaling formula for reinforcement‑learning‑after‑training of large language models, demonstrating accurate inter‑ and intra‑model performance prediction, learning‑efficiency saturation, data‑reuse benefits, and cross‑architecture validity.

Data ReuseLarge Language ModelsLlama 3
0 likes · 11 min read
ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models
Data Party THU
Data Party THU
Apr 3, 2026 · Artificial Intelligence

Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough

The article reviews the Kimi team's Attention Residuals approach, which substitutes traditional ResNet additive shortcuts with learned attention‑based weighting, explains the theoretical motivation linking depth to time, details full‑attention and block‑wise implementations, presents experimental results showing up to 1.25× compute efficiency and improved performance on reasoning and knowledge tasks.

Attention MechanismDeep LearningModel Efficiency
0 likes · 11 min read
Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough
Data Party THU
Data Party THU
Mar 31, 2026 · Artificial Intelligence

Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture

The STEM architecture replaces the Transformer feed‑forward network with a static token‑indexed embedding table, enabling lookup‑based memory that decouples capacity from compute, improves training stability, expands addressable memory, and delivers consistent performance gains on long‑context and knowledge‑intensive tasks.

Lookup MemoryModel EfficiencySTEM Architecture
0 likes · 8 min read
Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture
AntTech
AntTech
Mar 4, 2026 · Artificial Intelligence

Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs

A new Region‑to‑Image Distillation (R2I) approach lets multimodal large language models perceive tiny visual details in a single forward pass, eliminating costly tool calls while achieving state‑of‑the‑art accuracy on the ZoomBench fine‑grained benchmark.

Large Language ModelsModel EfficiencyMultimodal AI
0 likes · 11 min read
Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs
SuanNi
SuanNi
Mar 2, 2026 · Artificial Intelligence

Can Small Language Models Match Big AI with the Skills Framework?

A recent study from top universities examines how the Skills framework enables small language models to reduce memory usage, improve accuracy, and handle complex industrial tasks, revealing performance gaps across model sizes, dataset challenges, and code‑specialized variants while highlighting cost‑effective deployment strategies.

AIModel EfficiencySkills Framework
0 likes · 8 min read
Can Small Language Models Match Big AI with the Skills Framework?
SuanNi
SuanNi
Feb 26, 2026 · Artificial Intelligence

How BitDance’s 2.6B‑Parameter Model Beats 14B Counterparts with 8.7× Speedup

BitDance’s new multimodal AI model achieves an 8.7‑fold inference acceleration using only 2.6 billion parameters, surpasses 14‑billion‑parameter state‑of‑the‑art architectures in image generation quality, and introduces binary visual tokens, a binary diffusion head, and next‑block diffusion for efficient parallel autoregressive prediction.

AIBinary TokenizationModel Efficiency
0 likes · 11 min read
How BitDance’s 2.6B‑Parameter Model Beats 14B Counterparts with 8.7× Speedup
PaperAgent
PaperAgent
Jan 22, 2026 · Artificial Intelligence

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

The article presents STEM, a method that transforms dense and MoE transformer architectures by converting the expert routing step into a static table‑lookup operation, achieving higher parameter efficiency, lower communication overhead, and improved interpretability while maintaining or boosting downstream task performance.

Embedding LookupMixture of ExpertsModel Efficiency
0 likes · 6 min read
How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers
PaperAgent
PaperAgent
Nov 29, 2025 · Industry Insights

NeurIPS 2025 Insights: AI Agents, Reasoning, and the Shift to Real-World Systems

An analysis of the 5,984 papers accepted at NeurIPS 2025 shows a decisive move from ever‑larger models toward agents, reasoning‑focused LLMs, efficiency engineering, AI for Science, and trustworthy AI, signaling the transition from a research‑toy era to an engineering‑driven AI ecosystem.

AI for ScienceAI trendsAgents
0 likes · 7 min read
NeurIPS 2025 Insights: AI Agents, Reasoning, and the Shift to Real-World Systems
AntTech
AntTech
Nov 8, 2025 · Artificial Intelligence

Ant Group’s AntBaiLing Model: Pushing AI Scaling Limits with Trillion‑Parameter Efficiency

Ant Group’s President Luo Ji outlined how the AntBaiLing suite, featuring trillion‑parameter open‑source models, three efficiency breakthroughs, and a domestic compute cluster, is advancing AGI research and inclusive applications, especially in healthcare, while emphasizing ethical, trustworthy AI.

AGILarge Language ModelsModel Efficiency
0 likes · 5 min read
Ant Group’s AntBaiLing Model: Pushing AI Scaling Limits with Trillion‑Parameter Efficiency
DataFunTalk
DataFunTalk
Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

Knowledge DistillationModel EfficiencyOn‑Policy Distillation
0 likes · 15 min read
How On-Policy Distillation Cuts LLM Training Cost by 90%
Data Party THU
Data Party THU
Oct 21, 2025 · Artificial Intelligence

Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer

The paper presents a systematic scaling‑law study of the linear‑time xLSTM architecture versus quadratic‑time Transformers, evaluating parameter‑data loss surfaces, optimal model size under equal FLOP budgets, and inference latency components, and shows that xLSTM consistently offers better cost‑effectiveness across diverse contexts and budgets.

Inference OptimizationLinear Time ComplexityModel Efficiency
0 likes · 11 min read
Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer
Data Party THU
Data Party THU
Oct 20, 2025 · Artificial Intelligence

How Agentic RL Enables a 14B LLM to Outperform Giant Models – Inside rStar2‑Agent

This article analyzes the rStar2‑Agent paper, revealing how Agentic Reinforcement Learning, the GRPO‑RoC algorithm, a high‑throughput code‑execution service, and a three‑stage training recipe let a modest 14‑billion‑parameter model surpass much larger LLMs on challenging math benchmarks.

AI researchArtificial IntelligenceLLM
0 likes · 18 min read
How Agentic RL Enables a 14B LLM to Outperform Giant Models – Inside rStar2‑Agent
Kuaishou Tech
Kuaishou Tech
Jul 21, 2025 · Artificial Intelligence

Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning

The article introduces KAT‑V1 AutoThink, a dual‑mode large language model that automatically switches between thinking and non‑thinking modes based on problem difficulty, details its novel training paradigm, reinforcement‑learning enhancements, performance benchmarks against leading open‑source models, and provides open‑source resources for further research.

Knowledge DistillationLarge Language ModelModel Efficiency
0 likes · 14 min read
Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning
AIWalker
AIWalker
Jul 15, 2025 · Artificial Intelligence

Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%

This article presents Dynamic Vision Mamba (DyVM), a method that tackles token and block redundancy in Mamba‑based visual models through a novel re‑ordering pruning strategy and dynamic block selection, achieving a 35.2% FLOPs reduction with only a 1.7% accuracy loss while demonstrating strong generalization across tasks and architectures.

Dynamic Block SelectionFLOPs ReductionModel Efficiency
0 likes · 22 min read
Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%
AIWalker
AIWalker
Apr 16, 2025 · Artificial Intelligence

Plug‑and‑Play Multi‑Scale Attention: A Seamless Boost for Model Performance

This article reviews recent multi‑scale attention breakthroughs—including EMA, MSDA, VWA, and related modules—showing how they improve accuracy, cut FLOPs by up to 70%, and can be inserted into existing models with minimal effort, backed by code and paper links.

Deep LearningModel EfficiencyPlug-and-Play
0 likes · 10 min read
Plug‑and‑Play Multi‑Scale Attention: A Seamless Boost for Model Performance
AIWalker
AIWalker
Mar 14, 2025 · Artificial Intelligence

Dynamic Tanh Lets He Kaiming and LeCun Drop Transformer Normalization in 9 Lines

Researchers He Kaiming, Yann LeCun and colleagues propose a 9‑line Dynamic Tanh (DyT) layer that replaces LayerNorm/RMSNorm in Transformers, showing comparable or superior accuracy across vision, language, speech and DNA tasks while also reducing inference latency on modern GPUs.

AI researchDeep LearningDynamic Tanh
0 likes · 18 min read
Dynamic Tanh Lets He Kaiming and LeCun Drop Transformer Normalization in 9 Lines
Data Thinking Notes
Data Thinking Notes
Mar 9, 2025 · Artificial Intelligence

How DeepSeek R1 Uses Large‑Scale Reinforcement Learning to Rival OpenAI o1

DeepSeek R1, an open‑source large language model, leverages rule‑based, large‑scale reinforcement learning and mixed supervised‑fine‑tuning data to achieve deep reasoning comparable to OpenAI o1, illustrating China’s rapid AI progress, the importance of efficiency, and the democratizing impact of open AI research.

DeepSeekModel EfficiencyOpen-source AI
0 likes · 11 min read
How DeepSeek R1 Uses Large‑Scale Reinforcement Learning to Rival OpenAI o1
ZhongAn Tech Team
ZhongAn Tech Team
Feb 22, 2025 · Artificial Intelligence

How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave

This issue reviews China's first open‑source short‑film model SkyReels‑V1, DeepSeek's Native Sparse Attention breakthrough, xAI's massive Grok‑3 deployment on 200k H100 GPUs, and a knowledge‑graph‑guided RAG framework, highlighting their performance gains, architectural innovations, and industry impact.

AIIndustry TrendsKnowledge Graph
0 likes · 15 min read
How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave
Architect
Architect
Feb 19, 2025 · Artificial Intelligence

Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics

The article critically examines whether the pre‑training Scaling Law still applies to Grok 3, compares its compute usage and model size with DeepSeek and OpenAI models, evaluates the cost‑effectiveness of pre‑training, RL and test‑time scaling, and explores how these insights shape future large‑language‑model development strategies.

Grok 3Large Language ModelsModel Efficiency
0 likes · 11 min read
Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 7, 2025 · Artificial Intelligence

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.

AI performanceDeepSeekLarge Language Model
0 likes · 8 min read
How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%
Baidu Tech Salon
Baidu Tech Salon
Jun 14, 2024 · Artificial Intelligence

Why Large Models Signal the Dawn of General AI: Insights from Baidu’s CTO

In a keynote at the 2024 Beijing Zhiyuan Conference, Baidu’s CTO Wang Haifeng explained how large‑model universality and comprehensive capabilities are driving artificial general intelligence forward, highlighting scale laws, multimodal advances, agent technologies, and the industrial‑scale production of AI.

AI industrializationAI trendsDeep Learning
0 likes · 7 min read
Why Large Models Signal the Dawn of General AI: Insights from Baidu’s CTO
DataFunSummit
DataFunSummit
Mar 22, 2024 · Artificial Intelligence

Multi‑Layer Efficiency Challenges and Emerging Paradigms for Large Language Models

The article discusses how large AI models are moving toward a unified architecture that reduces task‑algorithm coupling, outlines the multi‑layer efficiency challenges—from model sparsity and quantization to software and infrastructure optimization—and highlights recent NVIDIA GTC 2024 and China AI Day events with registration details.

China AI DayModel EfficiencyNVIDIA GTC
0 likes · 12 min read
Multi‑Layer Efficiency Challenges and Emerging Paradigms for Large Language Models
Alimama Tech
Alimama Tech
Dec 21, 2022 · Artificial Intelligence

Adaptive Parameter Generation Network for Click-Through Rate Prediction

Adaptive Parameter Generation Network (APG) dynamically creates sample‑specific model parameters for click‑through‑rate prediction using low‑rank factorization, parameter sharing, and over‑parameterization, achieving up to 0.2% AUC improvement, 3% CTR lift, and up to 96.6% storage reduction with faster inference.

CTR PredictionDeep LearningModel Efficiency
0 likes · 14 min read
Adaptive Parameter Generation Network for Click-Through Rate Prediction
JD Cloud Developers
JD Cloud Developers
Aug 15, 2022 · Artificial Intelligence

How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss

This article explains how the Fine‑ and Coarse‑Granularity Hybrid Self‑Attention (FCA) mechanism reduces BERT’s computational cost by over 50% while keeping accuracy loss under 1%, detailing the method, experimental results, and its significance for efficient large‑scale language models.

BERTDeep LearningFCA
0 likes · 8 min read
How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss