AI FinOps 2.0 — Curated Series · 100 articles

Collection size

100 articles

Page 5 of 5

Jan 7, 2025 · Artificial Intelligence

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.

AI performanceDeepSeekMFU

0 likes · 8 min read

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

Xiaohongshu Tech REDtech

Jun 19, 2025 · Artificial Intelligence

Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?

The article introduces the Think When You Need (TWYN) method, a reinforcement‑learning approach that dynamically adapts chain‑of‑thought length, dramatically cuts redundant token generation in large language models, and maintains or improves accuracy across diverse reasoning benchmarks.

adaptive inferencechain of thoughtefficiency

0 likes · 9 min read

Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?

Architecture Digest

Feb 25, 2025 · Artificial Intelligence

DeepSeek Distillation Technology: Overview, Innovations, Architecture, Training, Performance, and Challenges

DeepSeek’s distillation technology combines data and model distillation to transfer knowledge from large teacher models to compact student models, detailing its definitions, principles, key innovations, architecture, training methods, performance gains, and challenges, especially in multimodal contexts.

AI researchDeepSeekKnowledge Distillation

0 likes · 16 min read

DeepSeek Distillation Technology: Overview, Innovations, Architecture, Training, Performance, and Challenges

Machine Learning Algorithms & Natural Language Processing

Apr 22, 2026 · Artificial Intelligence

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

The article presents a two‑step cross‑architecture distillation method that replaces the quadratic softmax attention of Transformers with a learned linear attention and then maps it onto a Mamba backbone, achieving near‑teacher performance while reducing inference cost to linear time.

Cross‑ArchitectureDistillationLinear Attention

0 likes · 8 min read

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

AI Programming Lab

Apr 5, 2026 · Artificial Intelligence

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

The article explains what tokens are, how different models tokenize text, the role of token embeddings, positional encoding, self‑attention, KV cache, and why output tokens cost far more than input tokens, while also covering pricing differences and prompt‑caching savings across major LLM providers.

KV cacheLLM pricingTokenization

0 likes · 13 min read

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

ITPUB

Apr 22, 2026 · Artificial Intelligence

Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10

Ant’s newly released Ling‑2.6‑flash model, hidden as the anonymous “Elephant Alpha,” combines a 104B‑parameter MoE design with only 7.4B active weights per inference, achieving ten‑fold token savings, top‑tier benchmark scores and a $0.10 per‑million‑token price that dramatically cuts inference costs for developers and enterprises.

AI inferenceToken Efficiencybenchmark

0 likes · 6 min read

Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10

Xiaohongshu Tech REDtech

Oct 11, 2024 · Artificial Intelligence

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

HASS aligns training and decoding contexts and objectives for speculative sampling, using harmonized objective distillation and multi-step context alignment, achieving 2.81–4.05× speedup and 8%–20% improvement over EAGLE‑2 while preserving generation quality in real-world deployments at Xiaohongshu.

AIHASSInference acceleration

0 likes · 11 min read

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

AsiaInfo Technology: New Tech Exploration

Jun 23, 2025 · Artificial Intelligence

How Generative Data‑Driven Model Distillation Boosts Large‑Model Performance and Cuts Compute

This article examines generative data‑driven model distillation as a technique that not only compresses large language models but also improves their accuracy, addresses data‑privacy constraints, and reduces computational costs, offering a practical roadmap and real‑world results from a corporate AI platform.

AI OptimizationKnowledge TransferMaaS platform

0 likes · 22 min read

How Generative Data‑Driven Model Distillation Boosts Large‑Model Performance and Cuts Compute

Old Zhang's AI Learning

Feb 16, 2026 · Artificial Intelligence

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AngelSlim introduces a full‑stack large‑model compression suite that uses quantization‑aware training to shrink a 1.8B LLM to 2‑bit precision, achieving less than 4% accuracy loss, supporting a wide range of models, speculative decoding, and providing end‑to‑end deployment instructions for MacBook M4 and server environments.

AngelSlimGGUFQAT

0 likes · 13 min read

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

Baobao Algorithm Notes

Apr 27, 2025 · Artificial Intelligence

How Model Fusion Cut LLM Chain‑of‑Thought Length by 40% Without Fine‑Tuning

A small tech firm, tngtech, released an open‑source model fusion called DeepSeek‑R1T‑Chimera that merges R1 inference with V3‑0324 without fine‑tuning, distillation, or prompts, achieving the same intelligence as R1 while reducing token output by 40% and speeding up inference.

Artificial IntelligenceDeepSeekLLM

0 likes · 4 min read

How Model Fusion Cut LLM Chain‑of‑Thought Length by 40% Without Fine‑Tuning

AIWalker

Jan 18, 2025 · Artificial Intelligence

How InternLM 3.0 Achieves High Performance with Just 4 TB of Training Data

Shanghai AI Laboratory’s InternLM 3.0 upgrade demonstrates that a refined 4 TB token dataset can boost a large‑language model’s performance beyond that of open‑source peers trained on 18 TB, cutting training cost by over 75% while merging regular dialogue with deep reasoning capabilities.

AI evaluationData EfficiencyInternLM

0 likes · 9 min read

How InternLM 3.0 Achieves High Performance with Just 4 TB of Training Data

Tencent Technical Engineering

Feb 19, 2025 · Artificial Intelligence

Reproduction and Analysis of DeepSeek R1/R1‑zero Reinforcement Learning Experiments

This note surveys four open‑source reproductions of DeepSeek R1/R1‑zero reinforcement‑learning pipelines, re‑implements their training on math and logic datasets using Qwen‑based models, shows that format‑plus‑accuracy rewards improve long‑chain reasoning though stability and scaling remain challenges, and outlines future directions for large‑scale RL and business deployment.

DeepSeek-R1large language modellong chain of thought

0 likes · 39 min read

Reproduction and Analysis of DeepSeek R1/R1‑zero Reinforcement Learning Experiments

AI Frontier Lectures

Apr 23, 2025 · Artificial Intelligence

Why Skipping the Thinking Step Makes Large Language Models More Accurate

UC Berkeley researchers found that forcing large language models to skip explicit reasoning—using a “NoThinking” mode—can achieve comparable or better accuracy with significantly fewer tokens, especially under token budget constraints, across math, coding, and theorem‑proving benchmarks.

NoThinkingToken Efficiencyreasoning

0 likes · 7 min read

Why Skipping the Thinking Step Makes Large Language Models More Accurate

Machine Heart

Apr 3, 2026 · Artificial Intelligence

Beyond Token Entropy: ReLaX Uses Latent Dynamics to Rethink Exploration‑Exploitation in LLM RL

The paper introduces ReLaX, a framework that shifts focus from token‑level entropy to the latent‑space dynamics of large models, employing Koopman operators and a Dynamic Spectral Divergence metric to quantitatively guide exploration‑exploitation balance, and demonstrates state‑of‑the‑art performance on both pure‑text and multimodal RL benchmarks.

Koopman operatorReLaXdynamic spectral divergence

0 likes · 12 min read

Beyond Token Entropy: ReLaX Uses Latent Dynamics to Rethink Exploration‑Exploitation in LLM RL

Baobao Algorithm Notes

Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM optimizationLinear Attention

0 likes · 22 min read

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

AI Engineering

Apr 13, 2026 · Artificial Intelligence

Why Your Tokens Burn Money Fast and How a Four‑Tier Model Stack Can Cut Costs

The article examines the rapid token consumption problem caused by popular LLM agents, proposes a four‑tier model hierarchy and concrete routing rules, and offers short‑term, long‑term, and budget‑friendly deployment recommendations to reduce expenses while maintaining performance.

LLMmodel tieringmulti‑model deployment

0 likes · 7 min read

Why Your Tokens Burn Money Fast and How a Four‑Tier Model Stack Can Cut Costs

Data Party THU

Aug 11, 2025 · Artificial Intelligence

What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

This article systematically compares the architectures of recent large language models—including DeepSeek V3/R1, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen 3, SmolLM 3 and Kimi 2—highlighting innovations such as MLA, MoE, post‑norm, sliding‑window attention, NoPE and optimizer choices, with diagrams and code examples to illustrate their impact on efficiency and performance.

ComparisonLLMMLA

0 likes · 12 min read

What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

Old Zhang's AI Learning

Mar 3, 2026 · Artificial Intelligence

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

This guide walks you through deploying Qwen3.5's 0.8B, 2B, 4B and 9B models on CPUs or modest GPUs using Unsloth's GGUF quantization, explains hardware requirements, shows how to run them with llama.cpp, llama‑server, vLLM or SGLang, and provides a free Colab fine‑tuning workflow with export options.

AI ModelsFine-tuningGGUF

0 likes · 19 min read

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

Baobao Algorithm Notes

Mar 28, 2024 · Artificial Intelligence

How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters

Qwen1.5‑MoE‑A2.7B is a 2.7 billion‑parameter Mixture‑of‑Experts model that delivers performance comparable to leading 7 billion‑parameter LLMs while cutting training cost by 75% and boosting inference speed by 1.74×, and the article details its architecture, benchmarks, efficiency analysis, and deployment steps.

Inference SpeedMoEModel Benchmark

0 likes · 13 min read

How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters

Meituan Technology Team

Sep 1, 2025 · Artificial Intelligence

LongCat-Flash-Chat: 560B MoE Model with 27B Active Params Sets New Benchmarks

LongCat-Flash-Chat, an open‑source 560‑billion‑parameter Mixture‑of‑Experts model that activates only 18.6‑31.3 B parameters per token, delivers state‑of‑the‑art performance on general, agentic, coding, and instruction‑following benchmarks while offering fast inference and efficient deployment options.

AI modelAgentic AILongCat-Flash-Chat

0 likes · 7 min read

LongCat-Flash-Chat: 560B MoE Model with 27B Active Params Sets New Benchmarks