Collection size
100 articles
Page 5 of 5
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 7, 2025 · Artificial Intelligence

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.

AI performanceDeepSeekMFU
0 likes · 8 min read
How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jun 19, 2025 · Artificial Intelligence

Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?

The article introduces the Think When You Need (TWYN) method, a reinforcement‑learning approach that dynamically adapts chain‑of‑thought length, dramatically cuts redundant token generation in large language models, and maintains or improves accuracy across diverse reasoning benchmarks.

adaptive inferencechain of thoughtefficiency
0 likes · 9 min read
Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?
Architecture Digest
Architecture Digest
Feb 25, 2025 · Artificial Intelligence

DeepSeek Distillation Technology: Overview, Innovations, Architecture, Training, Performance, and Challenges

DeepSeek’s distillation technology combines data and model distillation to transfer knowledge from large teacher models to compact student models, detailing its definitions, principles, key innovations, architecture, training methods, performance gains, and challenges, especially in multimodal contexts.

AI researchDeepSeekKnowledge Distillation
0 likes · 16 min read
DeepSeek Distillation Technology: Overview, Innovations, Architecture, Training, Performance, and Challenges
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 22, 2026 · Artificial Intelligence

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

The article presents a two‑step cross‑architecture distillation method that replaces the quadratic softmax attention of Transformers with a learned linear attention and then maps it onto a Mamba backbone, achieving near‑teacher performance while reducing inference cost to linear time.

Cross‑ArchitectureDistillationLinear Attention
0 likes · 8 min read
Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost
ITPUB
ITPUB
Apr 22, 2026 · Artificial Intelligence

Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10

Ant’s newly released Ling‑2.6‑flash model, hidden as the anonymous “Elephant Alpha,” combines a 104B‑parameter MoE design with only 7.4B active weights per inference, achieving ten‑fold token savings, top‑tier benchmark scores and a $0.10 per‑million‑token price that dramatically cuts inference costs for developers and enterprises.

AI inferenceToken Efficiencybenchmark
0 likes · 6 min read
Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Oct 11, 2024 · Artificial Intelligence

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

HASS aligns training and decoding contexts and objectives for speculative sampling, using harmonized objective distillation and multi-step context alignment, achieving 2.81–4.05× speedup and 8%–20% improvement over EAGLE‑2 while preserving generation quality in real-world deployments at Xiaohongshu.

AIHASSInference acceleration
0 likes · 11 min read
Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
Jun 23, 2025 · Artificial Intelligence

How Generative Data‑Driven Model Distillation Boosts Large‑Model Performance and Cuts Compute

This article examines generative data‑driven model distillation as a technique that not only compresses large language models but also improves their accuracy, addresses data‑privacy constraints, and reduces computational costs, offering a practical roadmap and real‑world results from a corporate AI platform.

AI OptimizationKnowledge TransferMaaS platform
0 likes · 22 min read
How Generative Data‑Driven Model Distillation Boosts Large‑Model Performance and Cuts Compute
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 16, 2026 · Artificial Intelligence

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AngelSlim introduces a full‑stack large‑model compression suite that uses quantization‑aware training to shrink a 1.8B LLM to 2‑bit precision, achieving less than 4% accuracy loss, supporting a wide range of models, speculative decoding, and providing end‑to‑end deployment instructions for MacBook M4 and server environments.

AngelSlimGGUFQAT
0 likes · 13 min read
A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression
AIWalker
AIWalker
Jan 18, 2025 · Artificial Intelligence

How InternLM 3.0 Achieves High Performance with Just 4 TB of Training Data

Shanghai AI Laboratory’s InternLM 3.0 upgrade demonstrates that a refined 4 TB token dataset can boost a large‑language model’s performance beyond that of open‑source peers trained on 18 TB, cutting training cost by over 75% while merging regular dialogue with deep reasoning capabilities.

AI evaluationData EfficiencyInternLM
0 likes · 9 min read
How InternLM 3.0 Achieves High Performance with Just 4 TB of Training Data
Tencent Technical Engineering
Tencent Technical Engineering
Feb 19, 2025 · Artificial Intelligence

Reproduction and Analysis of DeepSeek R1/R1‑zero Reinforcement Learning Experiments

This note surveys four open‑source reproductions of DeepSeek R1/R1‑zero reinforcement‑learning pipelines, re‑implements their training on math and logic datasets using Qwen‑based models, shows that format‑plus‑accuracy rewards improve long‑chain reasoning though stability and scaling remain challenges, and outlines future directions for large‑scale RL and business deployment.

DeepSeek-R1large language modellong chain of thought
0 likes · 39 min read
Reproduction and Analysis of DeepSeek R1/R1‑zero Reinforcement Learning Experiments
AI Frontier Lectures
AI Frontier Lectures
Apr 23, 2025 · Artificial Intelligence

Why Skipping the Thinking Step Makes Large Language Models More Accurate

UC Berkeley researchers found that forcing large language models to skip explicit reasoning—using a “NoThinking” mode—can achieve comparable or better accuracy with significantly fewer tokens, especially under token budget constraints, across math, coding, and theorem‑proving benchmarks.

NoThinkingToken Efficiencyreasoning
0 likes · 7 min read
Why Skipping the Thinking Step Makes Large Language Models More Accurate
Machine Heart
Machine Heart
Apr 3, 2026 · Artificial Intelligence

Beyond Token Entropy: ReLaX Uses Latent Dynamics to Rethink Exploration‑Exploitation in LLM RL

The paper introduces ReLaX, a framework that shifts focus from token‑level entropy to the latent‑space dynamics of large models, employing Koopman operators and a Dynamic Spectral Divergence metric to quantitatively guide exploration‑exploitation balance, and demonstrates state‑of‑the‑art performance on both pure‑text and multimodal RL benchmarks.

Koopman operatorReLaXdynamic spectral divergence
0 likes · 12 min read
Beyond Token Entropy: ReLaX Uses Latent Dynamics to Rethink Exploration‑Exploitation in LLM RL
Baobao Algorithm Notes
Baobao Algorithm Notes
Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM optimizationLinear Attention
0 likes · 22 min read
Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks
AI Engineering
AI Engineering
Apr 13, 2026 · Artificial Intelligence

Why Your Tokens Burn Money Fast and How a Four‑Tier Model Stack Can Cut Costs

The article examines the rapid token consumption problem caused by popular LLM agents, proposes a four‑tier model hierarchy and concrete routing rules, and offers short‑term, long‑term, and budget‑friendly deployment recommendations to reduce expenses while maintaining performance.

LLMmodel tieringmulti‑model deployment
0 likes · 7 min read
Why Your Tokens Burn Money Fast and How a Four‑Tier Model Stack Can Cut Costs
Data Party THU
Data Party THU
Aug 11, 2025 · Artificial Intelligence

What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

This article systematically compares the architectures of recent large language models—including DeepSeek V3/R1, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen 3, SmolLM 3 and Kimi 2—highlighting innovations such as MLA, MoE, post‑norm, sliding‑window attention, NoPE and optimizer choices, with diagrams and code examples to illustrate their impact on efficiency and performance.

ComparisonLLMMLA
0 likes · 12 min read
What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 3, 2026 · Artificial Intelligence

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

This guide walks you through deploying Qwen3.5's 0.8B, 2B, 4B and 9B models on CPUs or modest GPUs using Unsloth's GGUF quantization, explains hardware requirements, shows how to run them with llama.cpp, llama‑server, vLLM or SGLang, and provides a free Colab fine‑tuning workflow with export options.

AI ModelsFine-tuningGGUF
0 likes · 19 min read
How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 28, 2024 · Artificial Intelligence

How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters

Qwen1.5‑MoE‑A2.7B is a 2.7 billion‑parameter Mixture‑of‑Experts model that delivers performance comparable to leading 7 billion‑parameter LLMs while cutting training cost by 75% and boosting inference speed by 1.74×, and the article details its architecture, benchmarks, efficiency analysis, and deployment steps.

Inference SpeedMoEModel Benchmark
0 likes · 13 min read
How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters
Meituan Technology Team
Meituan Technology Team
Sep 1, 2025 · Artificial Intelligence

LongCat-Flash-Chat: 560B MoE Model with 27B Active Params Sets New Benchmarks

LongCat-Flash-Chat, an open‑source 560‑billion‑parameter Mixture‑of‑Experts model that activates only 18.6‑31.3 B parameters per token, delivers state‑of‑the‑art performance on general, agentic, coding, and instruction‑following benchmarks while offering fast inference and efficient deployment options.

AI modelAgentic AILongCat-Flash-Chat
0 likes · 7 min read
LongCat-Flash-Chat: 560B MoE Model with 27B Active Params Sets New Benchmarks