Tagged articles

multi-token prediction

15 articles · Page 1 of 1

Jun 17, 2026 · Artificial Intelligence

Local LLMs Viable: Sparse Attention, MoE, KV Compression, Multi‑Token Prediction

In early 2026, open‑source local large language models become practical alternatives thanks to sparse attention, MoE routing, latent KV compression, multi‑token prediction, and 4‑bit quantization, while hardware memory shortages and benchmark gaps with closed‑source models shape their deployment choices.

4-bit quantizationKV compressionMixture of Experts

0 likes · 13 min read

Local LLMs Viable: Sparse Attention, MoE, KV Compression, Multi‑Token Prediction

Data Party THU

May 29, 2026 · Artificial Intelligence

Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes

The article presents Token Superposition Training (TST), which temporarily averages embeddings of non‑overlapping token bags and predicts groups of tokens in a first phase before reverting to standard token‑wise prediction, achieving up to 2.5× pre‑training speedup on 10B‑1B MoE models without altering model architecture or inference.

LLM pretrainingMCE lossMixture of Experts

0 likes · 9 min read

Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes

Machine Learning Algorithms & Natural Language Processing

May 16, 2026 · Artificial Intelligence

Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture

Token Superposition Training (TST) speeds up large‑language‑model pre‑training by up to 2.5× without altering model architecture or compute budget, using a superposition phase that averages token embeddings into bags and predicts groups of tokens, followed by a standard recovery phase, as demonstrated on 10B‑parameter MoE and smaller models.

LLM pretrainingMCE lossMoE

0 likes · 10 min read

Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture

Machine Learning Algorithms & Natural Language Processing

May 14, 2026 · Artificial Intelligence

Boosting LLM Pre‑training 2.5× Without Architecture Changes or Extra Compute

Nous Research introduces Token Superposition Training, which groups tokens into bags, averages their embeddings, and predicts token groups without altering model architecture or adding compute, achieving up to 2.5× faster pre‑training while maintaining standard inference.

LLM pretrainingMCE lossMoE

0 likes · 10 min read

Boosting LLM Pre‑training 2.5× Without Architecture Changes or Extra Compute

Machine Learning Algorithms & Natural Language Processing

Feb 10, 2026 · Artificial Intelligence

Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge

The GLM-5 architecture, uncovered from a GitHub PR, doubles the previous model to 745 B parameters, adopts DeepSeek‑V3 sparse attention and multi‑token prediction, features a 78‑layer MoE with 256 experts, supports a 202K‑token context window, and its rumored test model "Pony Alpha" sparked a 60% rise in Zhipu AI's stock amid a crowded AI release season.

AI Stock ImpactDeepSeekGLM-5

0 likes · 6 min read

Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge

Old Zhang's AI Learning

Feb 3, 2026 · Artificial Intelligence

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Step‑3.5‑Flash, a 196‑billion‑parameter open‑source LLM that activates only 11 B per token via a Mixture‑of‑Experts design, delivers 3‑plus‑times faster inference, matches top‑tier closed‑source models on SWE‑bench and other benchmarks, supports 256 K context, runs on consumer‑grade hardware, and is already integrated into vLLM, SGLang, and Claude Code, though it has known token‑efficiency and domain‑stability limitations.

LLM BenchmarkMoEStep-3.5-Flash

0 likes · 11 min read

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

AI Algorithm Path

Sep 14, 2025 · Artificial Intelligence

Qwen3-Next: Achieving Unmatched Training and Inference Cost‑Effectiveness

Alibaba's Qwen team unveils Qwen3-Next, a hybrid expert LLM with 800 B parameters but only 30 B active, delivering training costs under one‑tenth of comparable dense models and more than ten‑fold inference throughput for long contexts, while matching or surpassing larger models on benchmark tasks.

AILLMQwen3-Next

0 likes · 9 min read

Qwen3-Next: Achieving Unmatched Training and Inference Cost‑Effectiveness

Baobao Algorithm Notes

Sep 10, 2025 · Artificial Intelligence

Qwen3-Next Unveiled: Sparse MoE, Hybrid Attention & Multi‑Token Prediction

A recent Hugging Face pull request reveals Alibaba’s upcoming Qwen3‑Next series, highlighting its extreme‑context, parameter‑efficient design that combines a 1:50 high‑sparsity MoE, a hybrid attention architecture mixing gated attention with Gated DeltaNet, and a Multi‑Token Prediction technique, promising ten‑fold throughput gains for 32K‑plus token contexts.

AI ArchitectureHybrid AttentionQwen3-Next

0 likes · 8 min read

Qwen3-Next Unveiled: Sparse MoE, Hybrid Attention & Multi‑Token Prediction

Tech Freedom Circle

Jul 17, 2025 · Artificial Intelligence

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

This article provides a detailed technical analysis of DeepSeek‑V3, covering its MOE architecture, the novel Multi‑head Latent Attention (MLA) mechanism, the DualPipe pipeline‑parallel algorithm, mixed‑precision FP8 training, and the Multi‑Token Prediction (MTP) inference improvements that together boost performance and efficiency.

DeepSeekDualPipeFP8

0 likes · 44 min read

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

AI Algorithm Path

Mar 26, 2025 · Artificial Intelligence

DeepSeek V3-0324 Upgrade Delivers Smarter Coding and Higher Code Quality

The DeepSeek V3-0324 model, released on March 24, 2025 with 6.85 trillion parameters and a Mixture‑of‑Experts architecture, is fully open‑source on Hugging Face and brings notable upgrades in coding ability, structured responses, stability, generation length, and speed, while offering performance comparable to leading closed‑source models such as Claude 3.7.

AI code generationDeepSeekMixture of Experts

0 likes · 10 min read

DeepSeek V3-0324 Upgrade Delivers Smarter Coding and Higher Code Quality

AntTech

Feb 27, 2025 · Artificial Intelligence

Entity Contrastive Learning via Multi-Token Parallel Prediction for Knowledge Graph Completion

Researchers from Ant Group and Zhejiang University propose K-ON, a multi-token parallel prediction method that enables large language models to perceive knowledge graph entities through entity-level contrastive learning, achieving superior performance, lower cost, and higher efficiency on KG completion benchmarks.

K-ONKnowledge Graphentity contrastive learning

0 likes · 8 min read

Entity Contrastive Learning via Multi-Token Parallel Prediction for Knowledge Graph Completion

IT Architects Alliance

Feb 15, 2025 · Artificial Intelligence

DeepSeek: Architecture, Core Technologies, Training Strategies, and Comparative Analysis

The article provides an in‑depth overview of DeepSeek's transformer‑based foundation, Mixture‑of‑Experts architecture, novel attention mechanisms, multi‑token prediction, FP8 mixed‑precision training, knowledge distillation, reinforcement‑learning approaches, and compares its performance and cost advantages against leading models such as GPT and Gemini.

AI model architectureDeepSeekFP8 training

0 likes · 29 min read

DeepSeek: Architecture, Core Technologies, Training Strategies, and Comparative Analysis

AI Algorithm Path

Feb 9, 2025 · Artificial Intelligence

Understanding Multi-Token Prediction in DeepSeek‑R1 Architecture

This article dissects the Multi‑Token Prediction (MTP) technique used in DeepSeek‑R1, contrasting it with traditional next‑token prediction, detailing Meta’s MTP design, DeepSeek’s adapted architecture, loss weighting, and why MTP is applied only during training to boost efficiency and model capability.

DeepSeekMTPTransformer

0 likes · 9 min read

Understanding Multi-Token Prediction in DeepSeek‑R1 Architecture

AI2ML AI to Machine Learning

Feb 5, 2025 · Artificial Intelligence

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

The article enumerates DeepSeek’s extensive technical optimizations—including Grouped Query Attention, Multi‑head Latent Attention, Mixture‑of‑Experts, 4D parallelism, quantization, and multi‑token prediction—that together enable cheap, high‑performance large language models.

4D parallelismDeepSeekGrouped Query Attention

0 likes · 8 min read

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

Baobao Algorithm Notes

Jan 15, 2025 · Artificial Intelligence

How Multi-Token Prediction Boosts LLM Training and Inference Efficiency

This article reviews the evolution of Multi‑Token Prediction (MTP) techniques—from early blockwise parallel decoding to Meta's and DeepSeek's implementations—explaining their architectures, training and inference workflows, and the speed‑up gains they offer for large language models.

DeepSeekLLMMTP

0 likes · 20 min read

How Multi-Token Prediction Boosts LLM Training and Inference Efficiency