Efficient Inference — 13 Technical Articles

Apr 26, 2026 · Artificial Intelligence

Balanced Thinking: Boost LLM Accuracy by 10% While Cutting Inference Length 35%

The paper introduces ReBalance, a training‑free two‑stage inference control framework that uses model confidence signals to dynamically balance reasoning depth, achieving up to a 10‑point accuracy gain and a 35.4% reduction in token length across multiple LLM sizes and benchmarks.

Balanced ThinkingConfidence SteeringEfficient Inference

0 likes · 9 min read

Balanced Thinking: Boost LLM Accuracy by 10% While Cutting Inference Length 35%

Machine Learning Algorithms & Natural Language Processing

Apr 16, 2026 · Artificial Intelligence

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

The article analyzes how reward‑shaping techniques can shorten the chain‑of‑thought outputs of Qwen 30‑parameter series models by 20‑40% while preserving or slightly improving performance on AIME‑25 and out‑of‑distribution benchmarks, and it details the experimental design, strategic considerations, and practical insights behind this efficient reasoning approach.

Efficient InferenceQwenReward Shaping

0 likes · 16 min read

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

Machine Heart

Apr 12, 2026 · Artificial Intelligence

LRT: Implicit Reasoning Chains Boost Speed and Accuracy by Removing Redundant Steps

Researchers introduce Latent Reasoning Tuning (LRT), a lightweight inference network that encodes explicit reasoning chains into fixed‑length latent vectors, eliminating thousands of decoding steps; experiments reveal substantial redundancy in traditional chains and demonstrate that LRT achieves faster, more accurate inference and outperforms existing efficient reasoning methods.

DeepSeekEfficient InferenceHybrid Reasoning

0 likes · 10 min read

LRT: Implicit Reasoning Chains Boost Speed and Accuracy by Removing Redundant Steps

PaperAgent

Apr 8, 2026 · Artificial Intelligence

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

This article examines the visual token redundancy in decoder-only multimodal large language models and introduces a training-free dynamic computation reduction framework—featuring Probe-Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that significantly lowers inference cost while preserving performance.

Efficient Inferencedecoder-only architecturedynamic computation

0 likes · 12 min read

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

Machine Learning Algorithms & Natural Language Processing

Feb 24, 2026 · Artificial Intelligence

How COMI Achieves 25‑Point Performance Gains at 32× Compression Using Marginal Information Gain (ICLR 2026)

The COMI framework introduces a marginal information gain metric and a coarse‑to‑fine adaptive compression strategy that preserves relevance and diversity, enabling 32× text compression while boosting downstream QA performance by up to 25 points and doubling inference speed.

Context CompressionEfficient InferenceLong-Context Retrieval

0 likes · 7 min read

How COMI Achieves 25‑Point Performance Gains at 32× Compression Using Marginal Information Gain (ICLR 2026)

AI Frontier Lectures

Feb 10, 2026 · Artificial Intelligence

Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens

FaithLens is an 8‑billion‑parameter model that surpasses GPT‑4.1 and other large models on twelve hallucination‑detection benchmarks while providing high‑quality natural‑language explanations, thanks to a novel data‑synthesis pipeline, three‑dimensional filtering, and rule‑based reinforcement learning.

Efficient InferenceLLM hallucinationexplainable AI

0 likes · 12 min read

Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens

PaperAgent

Jan 13, 2026 · Artificial Intelligence

How Engram’s Conditional Memory Redefines Sparsity in Large Language Models

DeepSeek’s newly released Engram module introduces a conditional memory mechanism that leverages O(1) N‑gram lookup to create a new sparsity axis for large language models, reducing early‑layer compute, improving inference efficiency, and delivering notable performance gains across reasoning and knowledge tasks, as demonstrated by extensive experiments on 27‑billion‑parameter models.

Conditional MemoryEfficient InferenceEngram

0 likes · 8 min read

How Engram’s Conditional Memory Redefines Sparsity in Large Language Models

21CTO

Nov 4, 2025 · Artificial Intelligence

LongCat-Flash-Omni: How an Open-Source 560B Model Achieves Real-Time Multimodal Mastery

LongCat-Flash-Omni, an open‑source 560 billion‑parameter multimodal model, combines efficient Shortcut‑Connected MoE architecture with advanced perception and speech modules to deliver low‑latency real‑time audio‑video interaction and state‑of‑the‑art performance across text, image, video, and audio tasks.

Efficient InferenceLarge Language ModelReal-Time Interaction

0 likes · 10 min read

LongCat-Flash-Omni: How an Open-Source 560B Model Achieves Real-Time Multimodal Mastery

AntTech

Oct 29, 2025 · Artificial Intelligence

Inside Ant’s Baoling: Balancing Efficiency and Reasoning in a 1‑Trillion‑Parameter Model

At the Ant Star Innovation Journey event, the Baoling team unveiled their roadmap for trillion‑parameter models, detailing the development of Ling‑1T, Ring‑1T and multimodal Ming series, the scaling‑law‑guided architecture, training innovations, evaluation methods, and open‑source releases that aim to advance efficient, high‑performance AI.

Efficient InferenceLarge Language ModelScaling Law

0 likes · 24 min read

Inside Ant’s Baoling: Balancing Efficiency and Reasoning in a 1‑Trillion‑Parameter Model

Meituan Technology Team

Oct 9, 2025 · Artificial Intelligence

How VSRM Cuts Redundant Reasoning Steps in Large Language Models

The paper introduces VSRM, a verifiable step‑reward mechanism that penalizes ineffective reasoning steps and rewards useful ones in large language model inference, dramatically shortening output length while preserving or even improving performance across multiple benchmarks and reinforcement‑learning algorithms.

AIEfficient Inferencelarge-language-models

0 likes · 10 min read

How VSRM Cuts Redundant Reasoning Steps in Large Language Models

AntTech

Sep 11, 2025 · Artificial Intelligence

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Ling-mini-2.0, an open-source 16 B MoE language model that activates only 1.4 B parameters, achieves dense-level performance with 7× efficiency, generates over 300 tokens / s, and introduces the first FP8 mixed-precision training suite, offering multiple pre-training checkpoints for the AI community.

Efficient InferenceFP8 trainingMoE

0 likes · 6 min read

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Alibaba Cloud Big Data AI Platform

Feb 25, 2025 · Artificial Intelligence

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

This article introduces DistilQwen2.5, a lightweight LLM series built on Qwen2.5 that uses a novel two‑layer distillation framework, instruction‑data optimization, and parameter‑fusion techniques to achieve higher performance while drastically reducing computational cost and deployment overhead.

Efficient InferenceLLMknowledge distillation

0 likes · 26 min read

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

Meituan Technology Team

Jun 23, 2022 · Artificial Intelligence

YOLOv6: An Efficient Industrial Object Detection Framework

YOLOv6, an open‑source industrial object detection framework from Meituan Visual Intelligence, combines a hardware‑friendly EfficientRep backbone, Rep‑PAN neck, and Efficient Decoupled Head with anchor‑free training, SimOTA assignment, and SIoU loss, delivering COCO AP up to 43.1% at over 500 FPS and supporting TensorRT, OpenVINO, MNN, TNN, and NCNN deployment.

Efficient InferenceYOLOv6anchor-free

0 likes · 13 min read

YOLOv6: An Efficient Industrial Object Detection Framework