Collection size
100 articles
Page 3 of 5
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 27, 2026 · Artificial Intelligence

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

This guide walks through setting up a Kunlun P800 XPU host, preparing Docker containers, deploying Qwen3‑8B/‑32B/‑VL models with vLLM‑Kunlun, benchmarking performance, and running full‑parameter DPO training using LLaMA‑Factory, providing scripts, configuration files, and troubleshooting tips for AI engineers.

DPOInferenceKunlun P800
0 likes · 32 min read
Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 28, 2025 · Artificial Intelligence

Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO

This article critically examines R1‑Zero‑style training by analyzing foundation models and reinforcement learning, uncovering pre‑training and optimization biases, proposing an unbiased Dr. GRPO method, and demonstrating a minimalist 7B‑model recipe that achieves new state‑of‑the‑art performance on AIME 2024.

GRPOLLM evaluationR1-Zero
0 likes · 20 min read
Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO
DataFunTalk
DataFunTalk
Mar 16, 2022 · Artificial Intelligence

Parameter-Efficient Sparsity Training for the PLUG Large-Scale Language Model

This article presents the PLUG 270‑billion‑parameter Chinese language model and introduces a parameter‑efficient sparsity training (PST) framework that combines unstructured and structured pruning with low‑rank decomposition to dramatically reduce model size while preserving downstream performance.

PLUGParameter-Efficient Trainingdeep learning
0 likes · 13 min read
Parameter-Efficient Sparsity Training for the PLUG Large-Scale Language Model
Alimama Tech
Alimama Tech
Oct 29, 2025 · Artificial Intelligence

LLM Breakthroughs at EMNLP 2025: Embedding Compression, Complex Instructions, Knowledge Scaling

EMNLP 2025 in Suzhou showcases Taobao's booth featuring four cutting‑edge AI papers that introduce a novel embedding compression framework, an automatic iterative refinement method for complex instruction generation, a knowledge infusion scaling law for large language models, and a video caption optimization approach for text‑to‑video generation.

embedding compressioninstruction generationknowledge infusion
0 likes · 7 min read
LLM Breakthroughs at EMNLP 2025: Embedding Compression, Complex Instructions, Knowledge Scaling
Ops Community
Ops Community
Jan 18, 2026 · Artificial Intelligence

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

This guide details how to replace native Transformers inference with the high‑performance vLLM engine, leveraging PagedAttention, continuous batching, tensor parallelism, and OpenAI‑compatible APIs to achieve 3‑4× higher throughput, lower latency, and scalable multi‑GPU deployments for production‑grade large language models.

Continuous batchingGPU OptimizationOpenAI API Compatibility
0 likes · 61 min read
How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching
SuanNi
SuanNi
Apr 21, 2026 · Artificial Intelligence

How Qwen3.6‑35B‑A3B Matches Dense Models with Only 30 B Active Parameters

The article analyzes Qwen3.6‑35B‑A3B’s MoE architecture, showing how its 30 B active parameters outperform larger dense models across programming, agent, and multimodal benchmarks, and examines the flagship Qwen3.6‑Max‑Preview’s substantial gains in world knowledge, instruction following, and third‑party rankings.

AI evaluationMixture of ExpertsQwen
0 likes · 5 min read
How Qwen3.6‑35B‑A3B Matches Dense Models with Only 30 B Active Parameters
PaperAgent
PaperAgent
Feb 15, 2026 · Artificial Intelligence

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

LLMLinear Attentionlong context
0 likes · 17 min read
How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits
DataFunSummit
DataFunSummit
Dec 21, 2021 · Artificial Intelligence

Large‑Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD

This talk presents Alibaba DAMO Academy’s recent work on compressing large pretrained language models, covering task‑adaptive AdaBERT, data‑augmented L2A, and meta‑knowledge distillation Meta‑KD, describing their motivations, architectures, NAS‑based search, loss designs, and experimental results across multiple NLP tasks.

NLPNeural architecture searchknowledge distillation
0 likes · 13 min read
Large‑Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 6, 2024 · Artificial Intelligence

Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

Alibaba Cloud’s AI platform PAI recently saw two papers accepted at EMNLP2024—VideoCLIP‑XL, which enhances video‑text representation for long descriptions using a large video‑long‑description dataset and novel pre‑training tasks, and TAPIR, a curriculum‑planning framework that distills instruction‑following abilities of large language models—while also releasing associated models, datasets, and integration tools for users.

DistillationEMNLP2024large-language-models
0 likes · 8 min read
Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI
Tencent Tech
Tencent Tech
Oct 27, 2025 · Artificial Intelligence

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

SpecExit combines early‑exit and speculative decoding to let large reasoning models detect when they have almost finished thinking, trimming redundant chain‑of‑thought steps, reducing over‑thinking by 72% and achieving up to 2.5× faster end‑to‑end inference without noticeable accuracy loss.

AIInference AccelerationSpeculative Decoding
0 likes · 6 min read
How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×
Baobao Algorithm Notes
Baobao Algorithm Notes
Aug 14, 2025 · Artificial Intelligence

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

The article analyzes the poor generalization of supervised fine‑tuning (SFT) for large language models, reveals its gradient as a high‑variance inverse‑probability policy gradient, proposes a one‑line Dynamic Fine‑Tuning correction, and shows substantial gains on challenging math and offline RL benchmarks.

Dynamic Fine-TuningGeneralizationLLM alignment
0 likes · 7 min read
Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It
Tencent Cloud Developer
Tencent Cloud Developer
Mar 3, 2022 · Artificial Intelligence

Model Distillation for Query-Document Matching: Techniques and Optimizations

We applied knowledge distillation to a video query‑document BERT matcher, compressing the 12‑layer teacher into production‑ready 1‑layer ALBERT and tiny TextCNN students using combined soft, hard, and relevance losses plus AutoML‑tuned hyper‑parameters, achieving sub‑5 ms latency and up to 2.4% AUC improvement over the original model.

ALBERTAutoMLBERT
0 likes · 12 min read
Model Distillation for Query-Document Matching: Techniques and Optimizations
DaTaobao Tech
DaTaobao Tech
Sep 27, 2023 · Artificial Intelligence

FlashAttention-2: Efficient Attention Algorithm for Transformer Acceleration and AIGC Applications

FlashAttention‑2 is an IO‑aware exact attention algorithm that cuts GPU HBM traffic through tiling and recomputation, optimizes non‑matmul FLOPs, expands sequence‑parallelism and warp‑level work distribution, delivering up to 2× speedup over FlashAttention, near‑GEMM efficiency, and enabling longer‑context Transformer training and inference for AIGC with fastunet and negligible accuracy loss.

AIGCFlashAttention-2GPU
0 likes · 20 min read
FlashAttention-2: Efficient Attention Algorithm for Transformer Acceleration and AIGC Applications
PaperAgent
PaperAgent
Apr 26, 2026 · Artificial Intelligence

ICLR 2026 Outstanding Papers Reveal the Real Test for LLMs

The ICLR 2026 Outstanding Paper awards spotlight two studies—one proving Transformers are mathematically succinct and another showing that all major LLMs lose about 39% performance in multi‑turn conversations, exposing a reliability gap missed by single‑turn benchmarks.

AI benchmarksICLR 2026LLM evaluation
0 likes · 7 min read
ICLR 2026 Outstanding Papers Reveal the Real Test for LLMs
Machine Heart
Machine Heart
Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Artificial IntelligenceCross-Architecture DistillationLinear Attention
0 likes · 7 min read
Apple Turns Transformers into Mamba with Linear‑Cost Distillation
DataFunTalk
DataFunTalk
Feb 28, 2025 · Artificial Intelligence

DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation

An in‑depth overview of the DeepSeek LLM series (V1‑V3) and the R1 models, covering their architectures, scaling‑law experiments, data pipelines, training strategies—including MoE, MLA, FP8, multi‑step learning‑rate scheduling, reinforcement learning, and extensive evaluation results, as well as knowledge‑distillation techniques.

Mixture of Expertsscaling laws
0 likes · 36 min read
DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation
Meituan Technology Team
Meituan Technology Team
Aug 8, 2024 · Artificial Intelligence

Highlights of Meituan's ACL 2024 Papers: Speculative Decoding, Graph‑Structured Decoding, DolphCoder, and Instruction Fine‑tuning

Meituan showcases four ACL 2024 papers—Early‑Exiting Speculative Decoding with a Thompson‑sampling controller, Graph‑Structured Speculative Decoding that merges draft hypotheses in a DAG, DolphCoder, a code‑generation LLM improved by diverse multi‑objective instruction tuning, and a study of instruction fine‑tuning that finds it mainly aligns existing knowledge—while inviting attendees to its booth 11 and a live paper discussion on August 12.

ACLInstruction TuningLLM
0 likes · 8 min read
Highlights of Meituan's ACL 2024 Papers: Speculative Decoding, Graph‑Structured Decoding, DolphCoder, and Instruction Fine‑tuning