Collection size
100 articles
Page 1 of 5
360 Smart Cloud
360 Smart Cloud
Dec 3, 2025 · Artificial Intelligence

How Model Distillation Enhances LLM Performance on the TLM Platform

This article explains the TLM large‑model development platform and details how knowledge distillation—using soft labels, temperature scaling, and combined loss functions—compresses teacher models into efficient student models, with practical steps and evaluation on the platform.

AILLMTLM platform
0 likes · 5 min read
How Model Distillation Enhances LLM Performance on the TLM Platform
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 12, 2026 · Artificial Intelligence

Distilling Claude Opus 4.6 into Qwen3.5‑27B: High‑Quality Reasoning on a Single RTX 3090

The article details how Claude Opus 4.6's chain‑of‑thought data were used to distill the 27‑billion‑parameter Qwen3.5‑27B model with Unsloth and LoRA, achieving full‑context inference on a single RTX 3090/4090, while outlining performance numbers, hyper‑parameter tips, benchmark gains and the trade‑offs of losing multimodal abilities.

Claude Opus 4.6GPU inferenceLoRA
0 likes · 7 min read
Distilling Claude Opus 4.6 into Qwen3.5‑27B: High‑Quality Reasoning on a Single RTX 3090
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 25, 2025 · Artificial Intelligence

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

This article introduces DistilQwen2.5, a lightweight LLM series built on Qwen2.5 that uses a novel two‑layer distillation framework, instruction‑data optimization, and parameter‑fusion techniques to achieve higher performance while drastically reducing computational cost and deployment overhead.

Efficient InferenceLLMknowledge distillation
0 likes · 26 min read
How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 22, 2025 · Artificial Intelligence

How DistilQwen2.5-DS3-0324 Achieves Fast, Accurate Reasoning via Quick‑Think Distillation

This article introduces DistilQwen2.5-DS3-0324, a distilled language model series that balances rapid inference with strong reasoning by applying a fast‑thinking chain‑of‑thought strategy, details its two‑stage distillation framework, evaluation on diverse benchmarks, and provides code for downloading and using the models.

chain of thoughtdeep learningfast inference
0 likes · 17 min read
How DistilQwen2.5-DS3-0324 Achieves Fast, Accurate Reasoning via Quick‑Think Distillation
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 30, 2025 · Artificial Intelligence

Unlocking Small LLM Power: Variable‑Length Chain Distillation with DistillQwen‑ThoughtY

This article introduces a variable‑length chain‑of‑thought distillation technique built on Alibaba Cloud PAI’s EasyDistill toolkit, presents the high‑quality OmniThought‑0528 dataset, details the training of the DistillQwen‑ThoughtY 4B/8B/32B models, and provides code and usage examples for researchers and practitioners.

DistillationLLMchain of thought
0 likes · 15 min read
Unlocking Small LLM Power: Variable‑Length Chain Distillation with DistillQwen‑ThoughtY
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 3, 2025 · Artificial Intelligence

Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research

An extensive analysis shows that a 1K‑sample fine‑tuning stage can replicate the generalization gains of thousands of reinforcement‑learning steps, explains the compressibility of RL, introduces a sample‑effect theory, and demonstrates that re‑distillation and small‑scale SFT dramatically improve LLM performance.

Re-distillationSample Effectlarge language models
0 likes · 23 min read
Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
May 28, 2025 · Artificial Intelligence

How EasyDistill Simplifies LLM Knowledge Distillation for Faster, Smaller Models

EasyDistill, an open‑source toolkit from Alibaba Cloud AI Platform, streamlines knowledge distillation of large language models by offering modular data synthesis, black‑box and white‑box training, reinforcement‑learning and preference‑optimization techniques, enabling the creation of compact, high‑performance DistilQwen models and accompanying datasets.

DistilQwenEasyDistillknowledge distillation
0 likes · 17 min read
How EasyDistill Simplifies LLM Knowledge Distillation for Faster, Smaller Models
Meituan Technology Team
Meituan Technology Team
Jan 8, 2026 · Artificial Intelligence

Must‑Read AAAI 2026 Papers: Efficient Reasoning, Annealing, Multimodal Diffusion & More

This article curates eight AAAI 2026 papers authored by the Meituan research team, covering verifiable stepwise rewards for LLM reasoning, annealing strategies in large‑scale training, process reward models, competence‑difficulty sampling, high‑fidelity visual text rendering, counterfactual fusion, compress‑then‑rank reranking, and cross‑modal quantization for generative recommendation, with direct PDF links for each work.

AAAI2026CounterfactualLLM
0 likes · 14 min read
Must‑Read AAAI 2026 Papers: Efficient Reasoning, Annealing, Multimodal Diffusion & More
Baobao Algorithm Notes
Baobao Algorithm Notes
Apr 27, 2025 · Artificial Intelligence

How DeepSeek R1T‑Chimera Cuts Tokens by 40% Without Fine‑Tuning

The DeepSeek‑R1T‑Chimera model merges DeepSeek‑R1 reasoning with V3‑0324 architecture, reusing most V3 weights and swapping only the blue‑highlighted R1 routing experts, achieving the same intelligence as R1 while reducing output tokens by about 40% and running faster, all without any fine‑tuning or distillation.

Artificial IntelligenceDeepSeekLLM
0 likes · 5 min read
How DeepSeek R1T‑Chimera Cuts Tokens by 40% Without Fine‑Tuning
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 4, 2025 · Artificial Intelligence

Unveiling the Law of Capacity Gap: Boosting Language Model Distillation Efficiency

At ACL 2025, a collaborative paper introduced the Law of Capacity Gap, revealing a linear 2.5× optimal teacher‑student size relationship in language model distillation, dramatically cutting compute costs and achieving Pareto‑optimal efficiency, with the MiniMA model as a successful demonstration.

DistillationMiniMAartificial-intelligence
0 likes · 7 min read
Unveiling the Law of Capacity Gap: Boosting Language Model Distillation Efficiency
AntTech
AntTech
Jun 21, 2025 · Artificial Intelligence

Ring-lite: Open‑Source Lightweight MoE Model Sets SOTA on AIME and LiveCodeBench

Ring-lite, an open‑source lightweight Mixture‑of‑Experts inference model built on Ling‑lite‑1.5, introduces the C3PO reinforcement‑learning training method and achieves state‑of‑the‑art results on benchmarks such as AIME24/25, LiveCodeBench, CodeForce, and GPQA‑diamond, while offering full transparency of weights, code, and data.

AI inferenceC3PObenchmark
0 likes · 11 min read
Ring-lite: Open‑Source Lightweight MoE Model Sets SOTA on AIME and LiveCodeBench
Didi Tech
Didi Tech
Mar 12, 2026 · Artificial Intelligence

How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens

The STAPO (Spurious‑Token‑Aware Policy Optimization) algorithm, introduced by Tsinghua University's iDLab and Didi's Deep Sea Lab, tackles policy‑entropy instability and performance oscillation in reinforcement‑learning fine‑tuning of large models by mathematically analyzing token collision probability, defining spurious tokens, and applying a Silencing Spurious Tokens mechanism that yields state‑of‑the‑art results on multiple math‑reasoning benchmarks.

AI safetyFine-tuningSTAPO
0 likes · 7 min read
How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 26, 2026 · Artificial Intelligence

Distilling Claude Opus into Qwen3.6-27B – GGUF Lets You Run Locally on Consumer GPUs

The preview model Qwopus3.6-27B‑v1, distilled from Claude Opus onto Qwen3.6‑27B using SFT with the Unsloth stack and a curated 12 K high‑quality inference sample set, is evaluated on agentic reasoning, front‑end design, and Canvas/WebGL tasks with an RTX 5090, and can be deployed locally via llama.cpp GGUF quantizations with detailed memory guidelines.

Apache 2.0Claude OpusGGUF
0 likes · 7 min read
Distilling Claude Opus into Qwen3.6-27B – GGUF Lets You Run Locally on Consumer GPUs
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 29, 2025 · Artificial Intelligence

How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation

The article introduces the DistilQwen2.5‑R1 series, which leverages a novel knowledge‑distillation pipeline—including CoT data evaluation, improvement, and validation—to transfer deep reasoning abilities from large models like DeepSeek‑R1 to compact models, achieving superior performance across math, code, and scientific benchmarks and providing open‑source checkpoints and deployment guides for practical use.

AI inferencebenchmark evaluationknowledge distillation
0 likes · 17 min read
How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation
AntTech
AntTech
Apr 23, 2026 · Artificial Intelligence

Ling-2.6-flash: Faster Response, Stronger Execution, and Higher Token Efficiency for Agent Workloads

Ling-2.6-flash is a 104B‑parameter Instruct model that uses a mixed‑linear architecture and token‑efficiency optimizations to achieve up to 340 tokens/s inference speed, 4× higher throughput than comparable models, and ten‑fold lower token consumption on Agent benchmarks, while maintaining SOTA performance.

Agent OptimizationInference EfficiencyLLM
0 likes · 15 min read
Ling-2.6-flash: Faster Response, Stronger Execution, and Higher Token Efficiency for Agent Workloads
DataFunTalk
DataFunTalk
Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

Model EfficiencyOn-Policy Distillationknowledge distillation
0 likes · 15 min read
How On-Policy Distillation Cuts LLM Training Cost by 90%
Top Architect
Top Architect
Feb 14, 2025 · Artificial Intelligence

DeepSeek Model Distillation: Principles, Innovations, Architecture, and Performance

This article provides an in‑depth overview of DeepSeek’s model distillation technology, covering its definition, core principles, innovative data‑model distillation integration, architecture design, training strategies, performance gains, and the challenges of scaling to multimodal data.

AI OptimizationDeepSeekKnowledge Transfer
0 likes · 16 min read
DeepSeek Model Distillation: Principles, Innovations, Architecture, and Performance
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 28, 2024 · Artificial Intelligence

What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights

The article provides a detailed technical overview of Gemma 2, covering its decoder‑only transformer design, novel attention mechanisms, logit soft‑capping, RMSNorm, knowledge‑distillation training on trillions of tokens, extensive pre‑training infrastructure, and benchmark evaluations that demonstrate its competitiveness against larger proprietary models.

AIGemma 2benchmark evaluation
0 likes · 14 min read
What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights