Tagged articles
13 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
May 14, 2026 · Artificial Intelligence

Boost Qwen3.6 with MTP: 1.5× Faster Local Deployment for Claude Code

The article explains how to enable Multi‑Token Prediction (MTP) in Qwen3.6 using a specific llama.cpp PR, achieving up to 1.5× faster local inference, details compilation steps, optimal parameters, memory requirements, and how to integrate the accelerated model with Claude Code while avoiding common pitfalls.

Claude CodeLLM accelerationMTP
0 likes · 11 min read
Boost Qwen3.6 with MTP: 1.5× Faster Local Deployment for Claude Code
Lao Guo's Learning Space
Lao Guo's Learning Space
May 7, 2026 · Artificial Intelligence

Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference

The article explains why large‑language‑model inference is bottlenecked by memory‑bandwidth, then details Google’s Gemma 4 MTP technique—using a small draft model with speculative decoding and shared KV‑Cache—to parallelize token prediction, achieving up to three‑fold speed gains without any loss in output quality, and provides step‑by‑step local deployment instructions.

Gemma 4Inference OptimizationKV cache
0 likes · 11 min read
Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 18, 2026 · Artificial Intelligence

NVIDIA Nemotron 3 Super: 7× Faster Than Qwen3.5 – Inside Hybrid Mamba‑Attention, LatentMoE, and MTP

NVIDIA’s Nemotron 3 Super, a 120.6 B‑parameter flagship model supporting 1 M‑token context, combines Hybrid Mamba‑Attention, LatentMoE, and Multi‑Token Prediction to achieve up to 7.5× higher inference throughput than Qwen3.5 while matching or surpassing its accuracy across a range of benchmarks.

Hybrid Mamba-AttentionLatentMoEMTP
0 likes · 11 min read
NVIDIA Nemotron 3 Super: 7× Faster Than Qwen3.5 – Inside Hybrid Mamba‑Attention, LatentMoE, and MTP
AI Insight Log
AI Insight Log
Dec 18, 2025 · Artificial Intelligence

Xiaomi’s New MiMo‑V2‑Flash LLM Rivals DeepSeek‑V3.2 and Near‑GPT‑5 High

Xiaomi’s MiMo‑V2‑Flash, a 309B‑parameter MoE LLM with only 15B active weights, uses Hybrid SWA, Multi‑Token Prediction and Multi‑Teacher On‑Policy Distillation to cut KV‑cache by six times, boost inference speed 2.6×, and achieve performance comparable to DeepSeek‑V3.2, Kimi‑K2 and near‑GPT‑5 High, including a 73.4% SWE‑Bench code‑agent score.

Hybrid SWAMOPDMTP
0 likes · 7 min read
Xiaomi’s New MiMo‑V2‑Flash LLM Rivals DeepSeek‑V3.2 and Near‑GPT‑5 High
Data Party THU
Data Party THU
Sep 21, 2025 · Artificial Intelligence

Building a Mini‑DeepSeek‑V3: Transformer Block and MTP Implementation on Limited Compute

This article walks through the design and implementation of a Mini‑DeepSeek‑V3 language model, detailing how to assemble the core Transformer block, integrate Multi‑Token Prediction (MTP) modules, construct the overall architecture, and compute the combined loss—all using modest GPU resources and a single‑card or DDP training setup.

AIDeepSeekMTP
0 likes · 12 min read
Building a Mini‑DeepSeek‑V3: Transformer Block and MTP Implementation on Limited Compute
Tencent Technical Engineering
Tencent Technical Engineering
Jul 11, 2025 · Artificial Intelligence

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

This article details the Angel‑HCF team's end‑to‑end DeepSeek inference optimizations—including PD separation, multi‑layer MTP, EP and DP parallelism, hardware‑aware kernels, and load‑balancing strategies—that boost throughput to over 15,800 tokens per second while keeping per‑token latency under 50 ms.

AI PerformanceDeepSeekGPU utilization
0 likes · 13 min read
How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations
Baidu Tech Salon
Baidu Tech Salon
Mar 13, 2025 · Artificial Intelligence

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

PaddlePaddle 3.0 introduces a full‑stack inference engine that supports FP8, INT8, and 4‑bit quantization for popular LLMs such as DeepSeek V3/R1, delivers up to 2× token throughput on a single H800 GPU, and provides detailed deployment scripts for single‑node and multi‑node setups, including MTP speculative decoding and SageAttention for long‑sequence acceleration.

DockerInference OptimizationMLA
0 likes · 13 min read
How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations
AI Algorithm Path
AI Algorithm Path
Feb 9, 2025 · Artificial Intelligence

Understanding Multi-Token Prediction in DeepSeek‑R1 Architecture

This article dissects the Multi‑Token Prediction (MTP) technique used in DeepSeek‑R1, contrasting it with traditional next‑token prediction, detailing Meta’s MTP design, DeepSeek’s adapted architecture, loss weighting, and why MTP is applied only during training to boost efficiency and model capability.

DeepSeekMTPModel architecture
0 likes · 9 min read
Understanding Multi-Token Prediction in DeepSeek‑R1 Architecture
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 15, 2025 · Artificial Intelligence

How Multi-Token Prediction Boosts LLM Training and Inference Efficiency

This article reviews the evolution of Multi‑Token Prediction (MTP) techniques—from early blockwise parallel decoding to Meta's and DeepSeek's implementations—explaining their architectures, training and inference workflows, and the speed‑up gains they offer for large language models.

DeepSeekInference AccelerationLLM
0 likes · 20 min read
How Multi-Token Prediction Boosts LLM Training and Inference Efficiency
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Feb 14, 2020 · Fundamentals

Comprehensive Guide to USB Protocol and Linux USB Driver Architecture

This guide thoroughly explains USB technology and its Linux implementation, covering fundamentals, transmission modes, descriptor structures, enumeration flow, gadget driver architecture with MTP details, and host driver mechanisms such as URBs, mouse and storage drivers, plus references for further study.

Linux driver developmentMTPNRZI encoding
0 likes · 12 min read
Comprehensive Guide to USB Protocol and Linux USB Driver Architecture
Ctrip Technology
Ctrip Technology
Feb 7, 2018 · Mobile Development

Ctrip's Mobile Tech Platform (MTP) and Mobile Continuous Delivery (MCD): Design, Implementation, and Outcomes

In 2017 Ctrip reorganized its wireless engineering to adopt a lifecycle‑driven, platform‑based approach, introducing the Mobile Tech Platform (MTP) and Mobile Continuous Delivery (MCD) platforms that unified component services, development frameworks, and automated build‑release pipelines for over 20+ apps, dramatically improving efficiency and quality.

Continuous DeliveryCtripMCD
0 likes · 9 min read
Ctrip's Mobile Tech Platform (MTP) and Mobile Continuous Delivery (MCD): Design, Implementation, and Outcomes