Tagged articles

28 articles

Page 1 of 1

May 17, 2026 · Artificial Intelligence

The Hidden Token Bill of AI Coding Agents: Why More Tokens Don’t Guarantee Better Results

An analysis of eight frontier coding agents shows that token consumption in agentic coding tasks is highly variable, often orders of magnitude higher than simple code reasoning, and that spending more tokens does not reliably improve accuracy, with significant differences across models and limited predictability of costs.

AI agentscoding agentscost analysis

0 likes · 11 min read

The Hidden Token Bill of AI Coding Agents: Why More Tokens Don’t Guarantee Better Results

Architects' Tech Alliance

May 8, 2026 · Artificial Intelligence

Token Fundamentals: A Technical Panorama of AI Language Units

Tokens are the smallest language building blocks that AI models process, representing characters, words, subwords, punctuation or emojis; they determine context window size and generation speed, so tokenization directly impacts model understanding accuracy and efficiency, as explained in the 2026 Token Report.

AI fundamentalsContext Windowlanguage models

0 likes · 4 min read

Token Fundamentals: A Technical Panorama of AI Language Units

Machine Heart

Apr 27, 2026 · Artificial Intelligence

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

The paper presents a systematic empirical study that derives a power‑law scaling formula for reinforcement‑learning‑after‑training of large language models, demonstrating accurate inter‑ and intra‑model performance prediction, learning‑efficiency saturation, data‑reuse benefits, and cross‑architecture validity.

Data ReuseLlama 3Qwen2.5

0 likes · 11 min read

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

AI Explorer

Apr 24, 2026 · Artificial Intelligence

Google’s ‘Banana’ Model Redefines Visual Transformers with Dynamic Sparse Attention

Google’s newly unveiled “Banana” visual Transformer introduces dynamic sparse attention that cuts inference cost 3‑5×, reduces memory by 70%, and improves ImageNet accuracy, while demonstrating real‑world gains in autonomous driving, medical imaging, and satellite analysis.

Computer VisionDynamic Sparse AttentionGoogle

0 likes · 6 min read

Google’s ‘Banana’ Model Redefines Visual Transformers with Dynamic Sparse Attention

Data Party THU

Apr 3, 2026 · Artificial Intelligence

Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough

The article reviews the Kimi team's Attention Residuals approach, which substitutes traditional ResNet additive shortcuts with learned attention‑based weighting, explains the theoretical motivation linking depth to time, details full‑attention and block‑wise implementations, presents experimental results showing up to 1.25× compute efficiency and improved performance on reasoning and knowledge tasks.

Attention MechanismDeep LearningResidual Networks

0 likes · 11 min read

Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough

Data Party THU

Mar 31, 2026 · Artificial Intelligence

Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture

The STEM architecture replaces the Transformer feed‑forward network with a static token‑indexed embedding table, enabling lookup‑based memory that decouples capacity from compute, improves training stability, expands addressable memory, and delivers consistent performance gains on long‑context and knowledge‑intensive tasks.

Lookup MemorySTEM ArchitectureTransformer

0 likes · 8 min read

Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture

AntTech

Mar 4, 2026 · Artificial Intelligence

Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs

A new Region‑to‑Image Distillation (R2I) approach lets multimodal large language models perceive tiny visual details in a single forward pass, eliminating costly tool calls while achieving state‑of‑the‑art accuracy on the ZoomBench fine‑grained benchmark.

Multimodal AIZoomBenchfine-grained perception

0 likes · 11 min read

Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs

AI Explorer

Mar 2, 2026 · Artificial Intelligence

How Alec Radford’s New Anthropic Model Could Redefine Large‑Scale AI Training

Alec Radford’s latest Anthropic model, backed by a $1 billion funding round, claims significant performance gains through more efficient algorithms, challenging OpenAI and Google while pushing the AI field toward safer, more controllable large‑scale models.

AI SafetyAI industryAlec Radford

0 likes · 5 min read

How Alec Radford’s New Anthropic Model Could Redefine Large‑Scale AI Training

SuanNi

Mar 2, 2026 · Artificial Intelligence

Can Small Language Models Match Big AI with the Skills Framework?

A recent study from top universities examines how the Skills framework enables small language models to reduce memory usage, improve accuracy, and handle complex industrial tasks, revealing performance gaps across model sizes, dataset challenges, and code‑specialized variants while highlighting cost‑effective deployment strategies.

AIIndustrial AISkills Framework

0 likes · 8 min read

Can Small Language Models Match Big AI with the Skills Framework?

SuanNi

Feb 26, 2026 · Artificial Intelligence

How BitDance’s 2.6B‑Parameter Model Beats 14B Counterparts with 8.7× Speedup

BitDance’s new multimodal AI model achieves an 8.7‑fold inference acceleration using only 2.6 billion parameters, surpasses 14‑billion‑parameter state‑of‑the‑art architectures in image generation quality, and introduces binary visual tokens, a binary diffusion head, and next‑block diffusion for efficient parallel autoregressive prediction.

AIBinary TokenizationVision Transformers

0 likes · 11 min read

How BitDance’s 2.6B‑Parameter Model Beats 14B Counterparts with 8.7× Speedup

PaperAgent

Jan 22, 2026 · Artificial Intelligence

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

The article presents STEM, a method that transforms dense and MoE transformer architectures by converting the expert routing step into a static table‑lookup operation, achieving higher parameter efficiency, lower communication overhead, and improved interpretability while maintaining or boosting downstream task performance.

Embedding LookupInterpretabilityMixture of Experts

0 likes · 6 min read

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

PaperAgent

Nov 29, 2025 · Industry Insights

NeurIPS 2025 Insights: AI Agents, Reasoning, and the Shift to Real-World Systems

An analysis of the 5,984 papers accepted at NeurIPS 2025 shows a decisive move from ever‑larger models toward agents, reasoning‑focused LLMs, efficiency engineering, AI for Science, and trustworthy AI, signaling the transition from a research‑toy era to an engineering‑driven AI ecosystem.

AI for ScienceAI trendsLLM

0 likes · 7 min read

NeurIPS 2025 Insights: AI Agents, Reasoning, and the Shift to Real-World Systems

AntTech

Nov 8, 2025 · Artificial Intelligence

Ant Group’s AntBaiLing Model: Pushing AI Scaling Limits with Trillion‑Parameter Efficiency

Ant Group’s President Luo Ji outlined how the AntBaiLing suite, featuring trillion‑parameter open‑source models, three efficiency breakthroughs, and a domestic compute cluster, is advancing AGI research and inclusive applications, especially in healthcare, while emphasizing ethical, trustworthy AI.

AGIlarge language modelsmodel efficiency

0 likes · 5 min read

Ant Group’s AntBaiLing Model: Pushing AI Scaling Limits with Trillion‑Parameter Efficiency

DataFunTalk

Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

On-Policy Distillationknowledge distillationmodel efficiency

0 likes · 15 min read

How On-Policy Distillation Cuts LLM Training Cost by 90%

Data Party THU

Oct 21, 2025 · Artificial Intelligence

Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer

The paper presents a systematic scaling‑law study of the linear‑time xLSTM architecture versus quadratic‑time Transformers, evaluating parameter‑data loss surfaces, optimal model size under equal FLOP budgets, and inference latency components, and shows that xLSTM consistently offers better cost‑effectiveness across diverse contexts and budgets.

Inference OptimizationLinear Time ComplexityTransformer

0 likes · 11 min read

Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer

Data Party THU

Oct 20, 2025 · Artificial Intelligence

How Agentic RL Enables a 14B LLM to Outperform Giant Models – Inside rStar2‑Agent

This article analyzes the rStar2‑Agent paper, revealing how Agentic Reinforcement Learning, the GRPO‑RoC algorithm, a high‑throughput code‑execution service, and a three‑stage training recipe let a modest 14‑billion‑parameter model surpass much larger LLMs on challenging math benchmarks.

AI researchLLMRL Optimization

0 likes · 18 min read

How Agentic RL Enables a 14B LLM to Outperform Giant Models – Inside rStar2‑Agent

Kuaishou Tech

Jul 21, 2025 · Artificial Intelligence

Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning

The article introduces KAT‑V1 AutoThink, a dual‑mode large language model that automatically switches between thinking and non‑thinking modes based on problem difficulty, details its novel training paradigm, reinforcement‑learning enhancements, performance benchmarks against leading open‑source models, and provides open‑source resources for further research.

auto-thinkknowledge distillationlarge language model

0 likes · 14 min read

Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning

AIWalker

Jul 15, 2025 · Artificial Intelligence

Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%

This article presents Dynamic Vision Mamba (DyVM), a method that tackles token and block redundancy in Mamba‑based visual models through a novel re‑ordering pruning strategy and dynamic block selection, achieving a 35.2% FLOPs reduction with only a 1.7% accuracy loss while demonstrating strong generalization across tasks and architectures.

Computer VisionDynamic Block SelectionFLOPs Reduction

0 likes · 22 min read

Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%

AIWalker

Apr 16, 2025 · Artificial Intelligence

Plug‑and‑Play Multi‑Scale Attention: A Seamless Boost for Model Performance

This article reviews recent multi‑scale attention breakthroughs—including EMA, MSDA, VWA, and related modules—showing how they improve accuracy, cut FLOPs by up to 70%, and can be inserted into existing models with minimal effort, backed by code and paper links.

Computer VisionDeep LearningPlug-and-Play

0 likes · 10 min read

Plug‑and‑Play Multi‑Scale Attention: A Seamless Boost for Model Performance

AIWalker

Mar 14, 2025 · Artificial Intelligence

Dynamic Tanh Lets He Kaiming and LeCun Drop Transformer Normalization in 9 Lines

Researchers He Kaiming, Yann LeCun and colleagues propose a 9‑line Dynamic Tanh (DyT) layer that replaces LayerNorm/RMSNorm in Transformers, showing comparable or superior accuracy across vision, language, speech and DNA tasks while also reducing inference latency on modern GPUs.

AI researchDeep LearningDynamic Tanh

0 likes · 18 min read

Dynamic Tanh Lets He Kaiming and LeCun Drop Transformer Normalization in 9 Lines

Data Thinking Notes

Mar 9, 2025 · Artificial Intelligence

How DeepSeek R1 Uses Large‑Scale Reinforcement Learning to Rival OpenAI o1

DeepSeek R1, an open‑source large language model, leverages rule‑based, large‑scale reinforcement learning and mixed supervised‑fine‑tuning data to achieve deep reasoning comparable to OpenAI o1, illustrating China’s rapid AI progress, the importance of efficiency, and the democratizing impact of open AI research.

DeepSeekmodel efficiencyopen-source AI

0 likes · 11 min read

How DeepSeek R1 Uses Large‑Scale Reinforcement Learning to Rival OpenAI o1

ZhongAn Tech Team

Feb 22, 2025 · Artificial Intelligence

How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave

This issue reviews China's first open‑source short‑film model SkyReels‑V1, DeepSeek's Native Sparse Attention breakthrough, xAI's massive Grok‑3 deployment on 200k H100 GPUs, and a knowledge‑graph‑guided RAG framework, highlighting their performance gains, architectural innovations, and industry impact.

AIKnowledge GraphRAG

0 likes · 15 min read

How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave

Architect

Feb 19, 2025 · Artificial Intelligence

Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics

The article critically examines whether the pre‑training Scaling Law still applies to Grok 3, compares its compute usage and model size with DeepSeek and OpenAI models, evaluates the cost‑effectiveness of pre‑training, RL and test‑time scaling, and explores how these insights shape future large‑language‑model development strategies.

Grok-3Pre‑trainingRL scaling

0 likes · 11 min read

Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics

Baobao Algorithm Notes

Jan 7, 2025 · Artificial Intelligence

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.

AI PerformanceDeepSeekMFU

0 likes · 8 min read

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

Baidu Tech Salon

Jun 14, 2024 · Artificial Intelligence

Why Large Models Signal the Dawn of General AI: Insights from Baidu’s CTO

In a keynote at the 2024 Beijing Zhiyuan Conference, Baidu’s CTO Wang Haifeng explained how large‑model universality and comprehensive capabilities are driving artificial general intelligence forward, highlighting scale laws, multimodal advances, agent technologies, and the industrial‑scale production of AI.

AI industrializationAI trendsDeep Learning

0 likes · 7 min read

Why Large Models Signal the Dawn of General AI: Insights from Baidu’s CTO

DataFunSummit

Mar 22, 2024 · Artificial Intelligence

Multi‑Layer Efficiency Challenges and Emerging Paradigms for Large Language Models

The article discusses how large AI models are moving toward a unified architecture that reduces task‑algorithm coupling, outlines the multi‑layer efficiency challenges—from model sparsity and quantization to software and infrastructure optimization—and highlights recent NVIDIA GTC 2024 and China AI Day events with registration details.

China AI DayNVIDIA GTCmodel efficiency

0 likes · 12 min read

Multi‑Layer Efficiency Challenges and Emerging Paradigms for Large Language Models

Alimama Tech

Dec 21, 2022 · Artificial Intelligence

Adaptive Parameter Generation Network for Click-Through Rate Prediction

Adaptive Parameter Generation Network (APG) dynamically creates sample‑specific model parameters for click‑through‑rate prediction using low‑rank factorization, parameter sharing, and over‑parameterization, achieving up to 0.2% AUC improvement, 3% CTR lift, and up to 96.6% storage reduction with faster inference.

CTR predictionDeep Learningadaptive parameter generation

0 likes · 14 min read

Adaptive Parameter Generation Network for Click-Through Rate Prediction

JD Cloud Developers

Aug 15, 2022 · Artificial Intelligence

How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss

This article explains how the Fine‑ and Coarse‑Granularity Hybrid Self‑Attention (FCA) mechanism reduces BERT’s computational cost by over 50% while keeping accuracy loss under 1%, detailing the method, experimental results, and its significance for efficient large‑scale language models.

BERTDeep LearningFCA

0 likes · 8 min read

How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss