Tagged articles
28 articles
Page 1 of 1
Machine Heart
Machine Heart
May 17, 2026 · Artificial Intelligence

The Hidden Token Bill of AI Coding Agents: Why More Tokens Don’t Guarantee Better Results

An analysis of eight frontier coding agents shows that token consumption in agentic coding tasks is highly variable, often orders of magnitude higher than simple code reasoning, and that spending more tokens does not reliably improve accuracy, with significant differences across models and limited predictability of costs.

AI agentscoding agentscost analysis
0 likes · 11 min read
The Hidden Token Bill of AI Coding Agents: Why More Tokens Don’t Guarantee Better Results
Architects' Tech Alliance
Architects' Tech Alliance
May 8, 2026 · Artificial Intelligence

Token Fundamentals: A Technical Panorama of AI Language Units

Tokens are the smallest language building blocks that AI models process, representing characters, words, subwords, punctuation or emojis; they determine context window size and generation speed, so tokenization directly impacts model understanding accuracy and efficiency, as explained in the 2026 Token Report.

AI fundamentalsContext Windowlanguage models
0 likes · 4 min read
Token Fundamentals: A Technical Panorama of AI Language Units
Machine Heart
Machine Heart
Apr 27, 2026 · Artificial Intelligence

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

The paper presents a systematic empirical study that derives a power‑law scaling formula for reinforcement‑learning‑after‑training of large language models, demonstrating accurate inter‑ and intra‑model performance prediction, learning‑efficiency saturation, data‑reuse benefits, and cross‑architecture validity.

Data ReuseLlama 3Qwen2.5
0 likes · 11 min read
ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models
Data Party THU
Data Party THU
Apr 3, 2026 · Artificial Intelligence

Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough

The article reviews the Kimi team's Attention Residuals approach, which substitutes traditional ResNet additive shortcuts with learned attention‑based weighting, explains the theoretical motivation linking depth to time, details full‑attention and block‑wise implementations, presents experimental results showing up to 1.25× compute efficiency and improved performance on reasoning and knowledge tasks.

Attention MechanismDeep LearningResidual Networks
0 likes · 11 min read
Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough
Data Party THU
Data Party THU
Mar 31, 2026 · Artificial Intelligence

Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture

The STEM architecture replaces the Transformer feed‑forward network with a static token‑indexed embedding table, enabling lookup‑based memory that decouples capacity from compute, improves training stability, expands addressable memory, and delivers consistent performance gains on long‑context and knowledge‑intensive tasks.

Lookup MemorySTEM ArchitectureTransformer
0 likes · 8 min read
Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture
AntTech
AntTech
Mar 4, 2026 · Artificial Intelligence

Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs

A new Region‑to‑Image Distillation (R2I) approach lets multimodal large language models perceive tiny visual details in a single forward pass, eliminating costly tool calls while achieving state‑of‑the‑art accuracy on the ZoomBench fine‑grained benchmark.

Multimodal AIZoomBenchfine-grained perception
0 likes · 11 min read
Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs
SuanNi
SuanNi
Mar 2, 2026 · Artificial Intelligence

Can Small Language Models Match Big AI with the Skills Framework?

A recent study from top universities examines how the Skills framework enables small language models to reduce memory usage, improve accuracy, and handle complex industrial tasks, revealing performance gaps across model sizes, dataset challenges, and code‑specialized variants while highlighting cost‑effective deployment strategies.

AIIndustrial AISkills Framework
0 likes · 8 min read
Can Small Language Models Match Big AI with the Skills Framework?
SuanNi
SuanNi
Feb 26, 2026 · Artificial Intelligence

How BitDance’s 2.6B‑Parameter Model Beats 14B Counterparts with 8.7× Speedup

BitDance’s new multimodal AI model achieves an 8.7‑fold inference acceleration using only 2.6 billion parameters, surpasses 14‑billion‑parameter state‑of‑the‑art architectures in image generation quality, and introduces binary visual tokens, a binary diffusion head, and next‑block diffusion for efficient parallel autoregressive prediction.

AIBinary TokenizationVision Transformers
0 likes · 11 min read
How BitDance’s 2.6B‑Parameter Model Beats 14B Counterparts with 8.7× Speedup
PaperAgent
PaperAgent
Jan 22, 2026 · Artificial Intelligence

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

The article presents STEM, a method that transforms dense and MoE transformer architectures by converting the expert routing step into a static table‑lookup operation, achieving higher parameter efficiency, lower communication overhead, and improved interpretability while maintaining or boosting downstream task performance.

Embedding LookupInterpretabilityMixture of Experts
0 likes · 6 min read
How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers
PaperAgent
PaperAgent
Nov 29, 2025 · Industry Insights

NeurIPS 2025 Insights: AI Agents, Reasoning, and the Shift to Real-World Systems

An analysis of the 5,984 papers accepted at NeurIPS 2025 shows a decisive move from ever‑larger models toward agents, reasoning‑focused LLMs, efficiency engineering, AI for Science, and trustworthy AI, signaling the transition from a research‑toy era to an engineering‑driven AI ecosystem.

AI for ScienceAI trendsLLM
0 likes · 7 min read
NeurIPS 2025 Insights: AI Agents, Reasoning, and the Shift to Real-World Systems
AntTech
AntTech
Nov 8, 2025 · Artificial Intelligence

Ant Group’s AntBaiLing Model: Pushing AI Scaling Limits with Trillion‑Parameter Efficiency

Ant Group’s President Luo Ji outlined how the AntBaiLing suite, featuring trillion‑parameter open‑source models, three efficiency breakthroughs, and a domestic compute cluster, is advancing AGI research and inclusive applications, especially in healthcare, while emphasizing ethical, trustworthy AI.

AGIlarge language modelsmodel efficiency
0 likes · 5 min read
Ant Group’s AntBaiLing Model: Pushing AI Scaling Limits with Trillion‑Parameter Efficiency
DataFunTalk
DataFunTalk
Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

On-Policy Distillationknowledge distillationmodel efficiency
0 likes · 15 min read
How On-Policy Distillation Cuts LLM Training Cost by 90%
Data Party THU
Data Party THU
Oct 21, 2025 · Artificial Intelligence

Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer

The paper presents a systematic scaling‑law study of the linear‑time xLSTM architecture versus quadratic‑time Transformers, evaluating parameter‑data loss surfaces, optimal model size under equal FLOP budgets, and inference latency components, and shows that xLSTM consistently offers better cost‑effectiveness across diverse contexts and budgets.

Inference OptimizationLinear Time ComplexityTransformer
0 likes · 11 min read
Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer
Kuaishou Tech
Kuaishou Tech
Jul 21, 2025 · Artificial Intelligence

Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning

The article introduces KAT‑V1 AutoThink, a dual‑mode large language model that automatically switches between thinking and non‑thinking modes based on problem difficulty, details its novel training paradigm, reinforcement‑learning enhancements, performance benchmarks against leading open‑source models, and provides open‑source resources for further research.

auto-thinkknowledge distillationlarge language model
0 likes · 14 min read
Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning
AIWalker
AIWalker
Jul 15, 2025 · Artificial Intelligence

Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%

This article presents Dynamic Vision Mamba (DyVM), a method that tackles token and block redundancy in Mamba‑based visual models through a novel re‑ordering pruning strategy and dynamic block selection, achieving a 35.2% FLOPs reduction with only a 1.7% accuracy loss while demonstrating strong generalization across tasks and architectures.

Computer VisionDynamic Block SelectionFLOPs Reduction
0 likes · 22 min read
Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%
AIWalker
AIWalker
Apr 16, 2025 · Artificial Intelligence

Plug‑and‑Play Multi‑Scale Attention: A Seamless Boost for Model Performance

This article reviews recent multi‑scale attention breakthroughs—including EMA, MSDA, VWA, and related modules—showing how they improve accuracy, cut FLOPs by up to 70%, and can be inserted into existing models with minimal effort, backed by code and paper links.

Computer VisionDeep LearningPlug-and-Play
0 likes · 10 min read
Plug‑and‑Play Multi‑Scale Attention: A Seamless Boost for Model Performance
AIWalker
AIWalker
Mar 14, 2025 · Artificial Intelligence

Dynamic Tanh Lets He Kaiming and LeCun Drop Transformer Normalization in 9 Lines

Researchers He Kaiming, Yann LeCun and colleagues propose a 9‑line Dynamic Tanh (DyT) layer that replaces LayerNorm/RMSNorm in Transformers, showing comparable or superior accuracy across vision, language, speech and DNA tasks while also reducing inference latency on modern GPUs.

AI researchDeep LearningDynamic Tanh
0 likes · 18 min read
Dynamic Tanh Lets He Kaiming and LeCun Drop Transformer Normalization in 9 Lines
Data Thinking Notes
Data Thinking Notes
Mar 9, 2025 · Artificial Intelligence

How DeepSeek R1 Uses Large‑Scale Reinforcement Learning to Rival OpenAI o1

DeepSeek R1, an open‑source large language model, leverages rule‑based, large‑scale reinforcement learning and mixed supervised‑fine‑tuning data to achieve deep reasoning comparable to OpenAI o1, illustrating China’s rapid AI progress, the importance of efficiency, and the democratizing impact of open AI research.

DeepSeekmodel efficiencyopen-source AI
0 likes · 11 min read
How DeepSeek R1 Uses Large‑Scale Reinforcement Learning to Rival OpenAI o1
ZhongAn Tech Team
ZhongAn Tech Team
Feb 22, 2025 · Artificial Intelligence

How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave

This issue reviews China's first open‑source short‑film model SkyReels‑V1, DeepSeek's Native Sparse Attention breakthrough, xAI's massive Grok‑3 deployment on 200k H100 GPUs, and a knowledge‑graph‑guided RAG framework, highlighting their performance gains, architectural innovations, and industry impact.

AIKnowledge GraphRAG
0 likes · 15 min read
How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave
Architect
Architect
Feb 19, 2025 · Artificial Intelligence

Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics

The article critically examines whether the pre‑training Scaling Law still applies to Grok 3, compares its compute usage and model size with DeepSeek and OpenAI models, evaluates the cost‑effectiveness of pre‑training, RL and test‑time scaling, and explores how these insights shape future large‑language‑model development strategies.

Grok-3Pre‑trainingRL scaling
0 likes · 11 min read
Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 7, 2025 · Artificial Intelligence

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.

AI PerformanceDeepSeekMFU
0 likes · 8 min read
How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%
Baidu Tech Salon
Baidu Tech Salon
Jun 14, 2024 · Artificial Intelligence

Why Large Models Signal the Dawn of General AI: Insights from Baidu’s CTO

In a keynote at the 2024 Beijing Zhiyuan Conference, Baidu’s CTO Wang Haifeng explained how large‑model universality and comprehensive capabilities are driving artificial general intelligence forward, highlighting scale laws, multimodal advances, agent technologies, and the industrial‑scale production of AI.

AI industrializationAI trendsDeep Learning
0 likes · 7 min read
Why Large Models Signal the Dawn of General AI: Insights from Baidu’s CTO
DataFunSummit
DataFunSummit
Mar 22, 2024 · Artificial Intelligence

Multi‑Layer Efficiency Challenges and Emerging Paradigms for Large Language Models

The article discusses how large AI models are moving toward a unified architecture that reduces task‑algorithm coupling, outlines the multi‑layer efficiency challenges—from model sparsity and quantization to software and infrastructure optimization—and highlights recent NVIDIA GTC 2024 and China AI Day events with registration details.

China AI DayNVIDIA GTCmodel efficiency
0 likes · 12 min read
Multi‑Layer Efficiency Challenges and Emerging Paradigms for Large Language Models
Alimama Tech
Alimama Tech
Dec 21, 2022 · Artificial Intelligence

Adaptive Parameter Generation Network for Click-Through Rate Prediction

Adaptive Parameter Generation Network (APG) dynamically creates sample‑specific model parameters for click‑through‑rate prediction using low‑rank factorization, parameter sharing, and over‑parameterization, achieving up to 0.2% AUC improvement, 3% CTR lift, and up to 96.6% storage reduction with faster inference.

CTR predictionDeep Learningadaptive parameter generation
0 likes · 14 min read
Adaptive Parameter Generation Network for Click-Through Rate Prediction
JD Cloud Developers
JD Cloud Developers
Aug 15, 2022 · Artificial Intelligence

How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss

This article explains how the Fine‑ and Coarse‑Granularity Hybrid Self‑Attention (FCA) mechanism reduces BERT’s computational cost by over 50% while keeping accuracy loss under 1%, detailing the method, experimental results, and its significance for efficient large‑scale language models.

BERTDeep LearningFCA
0 likes · 8 min read
How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss