Machine Heart
Machine Heart
Apr 26, 2026 · Artificial Intelligence

Balanced Thinking: Boost LLM Accuracy by 10% While Cutting Inference Length 35%

The paper introduces ReBalance, a training‑free two‑stage inference control framework that uses model confidence signals to dynamically balance reasoning depth, achieving up to a 10‑point accuracy gain and a 35.4% reduction in token length across multiple LLM sizes and benchmarks.

Balanced ThinkingConfidence SteeringEfficient Inference
0 likes · 9 min read
Balanced Thinking: Boost LLM Accuracy by 10% While Cutting Inference Length 35%
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 16, 2026 · Artificial Intelligence

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

The article analyzes how reward‑shaping techniques can shorten the chain‑of‑thought outputs of Qwen 30‑parameter series models by 20‑40% while preserving or slightly improving performance on AIME‑25 and out‑of‑distribution benchmarks, and it details the experimental design, strategic considerations, and practical insights behind this efficient reasoning approach.

Efficient InferenceQwenReward Shaping
0 likes · 16 min read
Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%
Machine Heart
Machine Heart
Apr 12, 2026 · Artificial Intelligence

LRT: Implicit Reasoning Chains Boost Speed and Accuracy by Removing Redundant Steps

Researchers introduce Latent Reasoning Tuning (LRT), a lightweight inference network that encodes explicit reasoning chains into fixed‑length latent vectors, eliminating thousands of decoding steps; experiments reveal substantial redundancy in traditional chains and demonstrate that LRT achieves faster, more accurate inference and outperforms existing efficient reasoning methods.

DeepSeekEfficient InferenceHybrid Reasoning
0 likes · 10 min read
LRT: Implicit Reasoning Chains Boost Speed and Accuracy by Removing Redundant Steps
PaperAgent
PaperAgent
Apr 8, 2026 · Artificial Intelligence

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

This article examines the visual token redundancy in decoder-only multimodal large language models and introduces a training-free dynamic computation reduction framework—featuring Probe-Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that significantly lowers inference cost while preserving performance.

Efficient Inferencedecoder-only architecturedynamic computation
0 likes · 12 min read
How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 24, 2026 · Artificial Intelligence

How COMI Achieves 25‑Point Performance Gains at 32× Compression Using Marginal Information Gain (ICLR 2026)

The COMI framework introduces a marginal information gain metric and a coarse‑to‑fine adaptive compression strategy that preserves relevance and diversity, enabling 32× text compression while boosting downstream QA performance by up to 25 points and doubling inference speed.

Context CompressionEfficient InferenceLong-Context Retrieval
0 likes · 7 min read
How COMI Achieves 25‑Point Performance Gains at 32× Compression Using Marginal Information Gain (ICLR 2026)
AI Frontier Lectures
AI Frontier Lectures
Feb 10, 2026 · Artificial Intelligence

Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens

FaithLens is an 8‑billion‑parameter model that surpasses GPT‑4.1 and other large models on twelve hallucination‑detection benchmarks while providing high‑quality natural‑language explanations, thanks to a novel data‑synthesis pipeline, three‑dimensional filtering, and rule‑based reinforcement learning.

Efficient InferenceLLM hallucinationexplainable AI
0 likes · 12 min read
Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens
PaperAgent
PaperAgent
Jan 13, 2026 · Artificial Intelligence

How Engram’s Conditional Memory Redefines Sparsity in Large Language Models

DeepSeek’s newly released Engram module introduces a conditional memory mechanism that leverages O(1) N‑gram lookup to create a new sparsity axis for large language models, reducing early‑layer compute, improving inference efficiency, and delivering notable performance gains across reasoning and knowledge tasks, as demonstrated by extensive experiments on 27‑billion‑parameter models.

Conditional MemoryEfficient InferenceEngram
0 likes · 8 min read
How Engram’s Conditional Memory Redefines Sparsity in Large Language Models
21CTO
21CTO
Nov 4, 2025 · Artificial Intelligence

LongCat-Flash-Omni: How an Open-Source 560B Model Achieves Real-Time Multimodal Mastery

LongCat-Flash-Omni, an open‑source 560 billion‑parameter multimodal model, combines efficient Shortcut‑Connected MoE architecture with advanced perception and speech modules to deliver low‑latency real‑time audio‑video interaction and state‑of‑the‑art performance across text, image, video, and audio tasks.

Efficient InferenceLarge Language ModelReal-Time Interaction
0 likes · 10 min read
LongCat-Flash-Omni: How an Open-Source 560B Model Achieves Real-Time Multimodal Mastery
AntTech
AntTech
Oct 29, 2025 · Artificial Intelligence

Inside Ant’s Baoling: Balancing Efficiency and Reasoning in a 1‑Trillion‑Parameter Model

At the Ant Star Innovation Journey event, the Baoling team unveiled their roadmap for trillion‑parameter models, detailing the development of Ling‑1T, Ring‑1T and multimodal Ming series, the scaling‑law‑guided architecture, training innovations, evaluation methods, and open‑source releases that aim to advance efficient, high‑performance AI.

Efficient InferenceLarge Language ModelScaling Law
0 likes · 24 min read
Inside Ant’s Baoling: Balancing Efficiency and Reasoning in a 1‑Trillion‑Parameter Model
Meituan Technology Team
Meituan Technology Team
Oct 9, 2025 · Artificial Intelligence

How VSRM Cuts Redundant Reasoning Steps in Large Language Models

The paper introduces VSRM, a verifiable step‑reward mechanism that penalizes ineffective reasoning steps and rewards useful ones in large language model inference, dramatically shortening output length while preserving or even improving performance across multiple benchmarks and reinforcement‑learning algorithms.

AIEfficient Inferencelarge-language-models
0 likes · 10 min read
How VSRM Cuts Redundant Reasoning Steps in Large Language Models
AntTech
AntTech
Sep 11, 2025 · Artificial Intelligence

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Ling-mini-2.0, an open-source 16 B MoE language model that activates only 1.4 B parameters, achieves dense-level performance with 7× efficiency, generates over 300 tokens / s, and introduces the first FP8 mixed-precision training suite, offering multiple pre-training checkpoints for the AI community.

Efficient InferenceFP8 trainingMoE
0 likes · 6 min read
Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 25, 2025 · Artificial Intelligence

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

This article introduces DistilQwen2.5, a lightweight LLM series built on Qwen2.5 that uses a novel two‑layer distillation framework, instruction‑data optimization, and parameter‑fusion techniques to achieve higher performance while drastically reducing computational cost and deployment overhead.

Efficient InferenceLLMknowledge distillation
0 likes · 26 min read
How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation
Meituan Technology Team
Meituan Technology Team
Jun 23, 2022 · Artificial Intelligence

YOLOv6: An Efficient Industrial Object Detection Framework

YOLOv6, an open‑source industrial object detection framework from Meituan Visual Intelligence, combines a hardware‑friendly EfficientRep backbone, Rep‑PAN neck, and Efficient Decoupled Head with anchor‑free training, SimOTA assignment, and SIoU loss, delivering COCO AP up to 43.1% at over 500 FPS and supporting TensorRT, OpenVINO, MNN, TNN, and NCNN deployment.

Efficient InferenceYOLOv6anchor-free
0 likes · 13 min read
YOLOv6: An Efficient Industrial Object Detection Framework