AI FinOps 2.0 — Curated Series · 100 articles

Collection size

100 articles

Page 4 of 5

Oct 31, 2025 · Artificial Intelligence

How SpecExit Cuts LLM Reasoning Chains by 66% and Boosts Inference Speed 2.5×

SpecExit combines speculative sampling with a lightweight draft model to predict early‑exit signals, shortening large‑reasoning model chains by up to two‑thirds and achieving up to 2.5× end‑to‑end inference acceleration on vLLM without sacrificing accuracy.

AI EfficiencyEarly StoppingSpecExit

0 likes · 12 min read

How SpecExit Cuts LLM Reasoning Chains by 66% and Boosts Inference Speed 2.5×

Machine Learning Algorithms & Natural Language Processing

Mar 11, 2026 · Artificial Intelligence

Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency

The ICLR2026 paper identifies reasoning miscalibration—overthinking easy steps and underthinking critical ones—as the root cause of runaway LLM inference costs, and proposes the Budget Allocation Model (BAM) and a training‑free Plan‑and‑Budget framework that smartly distributes compute, achieving up to 70% higher accuracy while cutting token usage by 39% and boosting the new E³ efficiency metric by 193.8%.

Budget Allocation ModelE3 MetricEpistemic Uncertainty

0 likes · 12 min read

Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency

AI Frontier Lectures

Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models

0 likes · 12 min read

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

Baobao Algorithm Notes

May 6, 2024 · Artificial Intelligence

DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%

DeepSeek‑V2 is a 236‑billion‑parameter mixture‑of‑experts language model that reduces training cost by 42.5 %, cuts KV‑cache usage by 93.3 %, and boosts generation throughput 5.76×, while achieving state‑of‑the‑art scores on benchmarks such as MMLU, C‑Eval, BBH, HumanEval, and GSM8K for both base and chat variants.

AIDeepSeek-V2Mixture of Experts

0 likes · 11 min read

DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%

Alibaba Cloud Big Data AI Platform

Nov 8, 2024 · Artificial Intelligence

How TAPIR Boosts Small LLMs with Task‑Aware Curriculum Planning

The paper introduces TAPIR, a task‑aware curriculum planning framework that distills instruction‑following abilities from black‑box LLM teachers into smaller student models by filtering difficult prompts, resampling tasks, enhancing response styles, and iteratively optimizing across multiple training rounds, achieving superior performance on benchmark evaluations.

Instruction TuningKnowledge DistillationLLM distillation

0 likes · 10 min read

How TAPIR Boosts Small LLMs with Task‑Aware Curriculum Planning

Amap Tech

Dec 30, 2020 · Artificial Intelligence

LRC-BERT: Contrastive Learning based Knowledge Distillation with COS‑NCE Loss for Efficient NLP Models

The Amap team introduced LRC‑BERT, a contrastive‑learning‑based knowledge‑distillation framework that employs a novel COS‑NCE loss, gradient‑perturbation, and a two‑stage training schedule, enabling a 4‑layer student model to retain about 97 % of BERT‑Base accuracy while being 7.5× smaller and 9.6× faster, and it has already improved real‑world traffic‑event extraction performance.

BERTCOS-NCE lossNLP

0 likes · 16 min read

LRC-BERT: Contrastive Learning based Knowledge Distillation with COS‑NCE Loss for Efficient NLP Models

Xiaohongshu Tech REDtech

Jun 20, 2024 · Artificial Intelligence

Xiaohongshu 2024 Large Model Frontier Paper Sharing Live Event

On June 27, 2024, Xiaohongshu’s technical team will livestream a two‑hour session across WeChat Channels, Bilibili, Douyin and Xiaohongshu, showcasing six top‑conference papers on large‑model advances—including early‑stopping and fine‑grained self‑consistency, novel evaluation methods, negative‑sample‑assisted distillation, and LLM‑based note recommendation—followed by a Q&A and recruitment briefing.

AI researchKnowledge DistillationSelf-Consistency

0 likes · 12 min read

Xiaohongshu 2024 Large Model Frontier Paper Sharing Live Event

AI Frontier Lectures

Apr 24, 2025 · Artificial Intelligence

How d1 Boosts Reasoning in Diffusion LLMs with Reinforcement Learning

Researchers from UCLA and Meta AI introduce d1, a two‑stage post‑training framework that combines supervised fine‑tuning and a novel diffu‑GRPO reinforcement‑learning algorithm to enable efficient reasoning in masked diffusion large language models, achieving state‑of‑the‑art performance on multiple math and logic benchmarks.

AId1diffu-GRPO

0 likes · 9 min read

How d1 Boosts Reasoning in Diffusion LLMs with Reinforcement Learning

Wu Shixiong's Large Model Academy

Sep 19, 2025 · Artificial Intelligence

Master Parameter-Efficient Fine‑Tuning: LoRA & QLoRA Explained for Interviews

This article explains why full fine‑tuning of large models is impractical, introduces parameter‑efficient fine‑tuning (PEFT) with LoRA and QLoRA, provides mathematical foundations, implementation code, resource‑usage analysis, interview question templates, and practical deployment tips for real‑world AI projects.

LoRAQLoRAlow-rank adaptation

0 likes · 24 min read

Master Parameter-Efficient Fine‑Tuning: LoRA & QLoRA Explained for Interviews

Data Party THU

Oct 25, 2025 · Artificial Intelligence

How InfLLM‑V2 Delivers Fast, Low‑Cost Sparse Attention for Long‑Context LLMs

InfLLM‑V2 introduces a zero‑parameter, train‑efficient sparse‑attention framework that dramatically speeds up long‑sequence processing while requiring only 5 B tokens for training, and the open‑source MiniCPM4.1 model demonstrates comparable performance to dense attention on both long‑text understanding and deep‑thinking benchmarks.

InfLLM-V2MiniCPM4.1efficiency

0 likes · 10 min read

How InfLLM‑V2 Delivers Fast, Low‑Cost Sparse Attention for Long‑Context LLMs

Alibaba Cloud Big Data AI Platform

Feb 24, 2025 · Artificial Intelligence

How to Distill and Fine‑Tune DeepSeek R1 with Qwen on Alibaba Cloud PAI

This guide walks you through the complete workflow of preparing instruction data, deploying the DeepSeek‑R1 teacher model, using Alibaba Cloud PAI to generate teacher responses, distilling a smaller Qwen2.5‑7B‑Instruct student model, fine‑tuning it, and deploying the final service, with performance comparisons on several math‑reasoning benchmarks.

Alibaba Cloud PAIDeepSeek

0 likes · 17 min read

How to Distill and Fine‑Tune DeepSeek R1 with Qwen on Alibaba Cloud PAI

Old Zhang's AI Learning

Apr 17, 2026 · Artificial Intelligence

How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)

The article explains how DFlash’s block‑diffusion draft model and KV Injection boost speculative decoding speed by 5‑8× without sacrificing output quality, and how DDTree further raises the gain to over 8×, backed by benchmark results and integration guides for major inference frameworks.

DDTreeDFlashLarge Language Model Inference

0 likes · 7 min read

How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)

PaperAgent

Mar 31, 2026 · Artificial Intelligence

Can Dynamic Computation Reduction Slash Redundancy in Decoder‑Only Multimodal LLMs?

This article analyzes the visual token redundancy in decoder‑only multimodal large language models and presents a training‑free dynamic computation reduction framework—including Probe‑Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that dramatically speeds up inference while preserving or even improving model performance.

decoder-only MLLMdynamic computationmultimodal AI

0 likes · 13 min read

Can Dynamic Computation Reduction Slash Redundancy in Decoder‑Only Multimodal LLMs?

Data Party THU

Oct 10, 2025 · Artificial Intelligence

How DPad Cuts Inference Time 61× While Boosting Accuracy in Diffusion LLMs

The article analyzes a recent Duke University paper that reveals a "scratchpad" mechanism in diffusion large language models, proposes the DPad method to prune redundant suffix tokens before decoding, and demonstrates up to 61.4× faster inference with unchanged or even improved accuracy across multiple benchmarks.

DPadInference accelerationdiffusion LLM

0 likes · 10 min read

How DPad Cuts Inference Time 61× While Boosting Accuracy in Diffusion LLMs

SuanNi

Feb 27, 2026 · Artificial Intelligence

Can Deep Thought Ratio Reveal the True Reasoning Power of LLMs?

This article introduces the Deep Thought Ratio (DTR) metric, explains how tracking token modifications across neural network layers quantifies genuine inference effort, and shows through extensive experiments that DTR predicts accuracy far better than token length while enabling a sampling strategy that halves computational cost.

AI metricsInference EfficiencyLLM evaluation

0 likes · 9 min read

Can Deep Thought Ratio Reveal the True Reasoning Power of LLMs?

AntTech

Sep 14, 2025 · Artificial Intelligence

Ring-mini-2.0: How a 16B MoE Model Delivers 128K Context and 500+ Tokens/s

Ring-mini-2.0 is a high‑performance inference MoE model that activates only 1.4 B parameters out of 16 B total, achieving dense‑model quality below 10 B while supporting 128 K context length and ultra‑fast generation speeds of over 300 tokens/s.

AIMoEinference optimization

0 likes · 4 min read

Ring-mini-2.0: How a 16B MoE Model Delivers 128K Context and 500+ Tokens/s

Machine Heart

Apr 21, 2026 · Artificial Intelligence

Is Your Skill Document Slowing Down the Model? Strategy‑Based Genes Are the Better Solution

The article analyses why large, document‑style Skill packages often degrade large‑model performance under limited inference budgets, introduces the compact, control‑dense Gene representation and the Gene Evolution Protocol (GEP), and shows through thousands of controlled experiments and CritPt benchmarks that Genes consistently outperform Skills, especially when token budget is tight.

AgentExperienceGene

0 likes · 15 min read

Is Your Skill Document Slowing Down the Model? Strategy‑Based Genes Are the Better Solution

Architect's Alchemy Furnace

Jul 17, 2025 · Artificial Intelligence

Explore the Ultimate Open-Source LLM Catalog: Models, Tools, and Resources

This article compiles a comprehensive, up‑to‑date inventory of open‑source large language models from Chinese and international organizations, detailing each model’s architecture, parameter count, multilingual capabilities, deployment requirements, and associated tools, offering a valuable reference for AI researchers and developers.

AILLMTools

0 likes · 50 min read

Explore the Ultimate Open-Source LLM Catalog: Models, Tools, and Resources

Data Party THU

Sep 8, 2025 · Artificial Intelligence

Why Small Language Models Will Dominate Agentic AI by 2025

By 2025, Agentic AI is shifting from massive LLMs to cost‑effective Small Language Models (SLMs), driven by their comparable performance, lower latency, and dramatically reduced inference and fine‑tuning costs, as detailed through market data, model benchmarks, migration steps, and real‑world case studies.

AIAgentic AILLM

0 likes · 6 min read

Why Small Language Models Will Dominate Agentic AI by 2025

Old Zhang's AI Learning

Feb 5, 2026 · Artificial Intelligence

Distilling GLM‑4.7‑Flash with Claude‑Opus‑4.5 for Easy Consumer‑GPU Deployment

The article explains how TeichAI used Claude‑Opus‑4.5 to generate a high‑quality 250‑sample reasoning dataset and distill the GLM‑4.7‑Flash model into a compact GGUF version that runs on a single consumer‑grade GPU via llama.cpp, detailing the workflow, quantization options, and practical considerations.

AI datasetsGGUFUnsloth

0 likes · 6 min read

Distilling GLM‑4.7‑Flash with Claude‑Opus‑4.5 for Easy Consumer‑GPU Deployment