Artificial Intelligence 9 min read

How DeepSeek‑V3.2‑Exp Achieves Fast Distributed LLM Inference with FP8 and MoE

This article walks through the DeepSeek‑V3.2‑Exp inference codebase, detailing its MoE architecture, Multi‑Head Latent Attention, FP8 quantization, custom CUDA kernels, and 8‑GPU NCCL‑based distributed execution from initialization through prefill and decode stages.

BirdNest Tech Talk

Oct 15, 2025

How DeepSeek‑V3.2‑Exp Achieves Fast Distributed LLM Inference with FP8 and MoE

Core Features

Model weight conversion ( inference/convert.py): converts HuggingFace checkpoints into sharded formats required for distributed inference, slicing expert weights across GPUs according to the model parallelism (mp) setting.

Text generation ( inference/generate.py): provides interactive and batch generation modes, supports multi‑GPU distributed inference via torch.distributed for inter‑process communication.

Optimized kernels ( inference/kernel.py): implements FP8 quantization and accelerated matrix‑multiply kernels to boost inference speed.

System Awakens

The user types a Chinese query, e.g., "Does the fate of Xue Baoqin in Dream of the Red Chamber get hinted at early?". The system launches eight GPU processes with the torchrun command; each process receives a unique rank and establishes NCCL‑based peer‑to‑peer communication.

CUDA devices are set to bfloat16 precision, balancing speed and accuracy. Model configuration is loaded from a JSON file, and a 671‑billion‑parameter model stored in a safetensors file is partitioned across the eight GPUs. The terminal greets the user with "I'm DeepSeek 👋".

Act One: Input Journey

Rank 0 receives the user question and broadcasts it to the other seven processes using dist.broadcast_object_list(), ensuring all GPUs work on the same prompt.

The prompt is appended to the messages list, then the tokenizer converts the Chinese text into a sequence of token IDs. The apply_chat_template() method formats the dialogue and adds special generation tokens.

Act Two: Prefill Stage – Understanding the Prompt

The generate() function takes control, allocating a large token tensor that acts as a blank canvas.

The first forward pass ("prefill") feeds the entire prompt into the model. Tokens pass through a ParallelEmbedding layer, becoming high‑dimensional vectors, after which RoPE positional encodings are applied.

MLA Attention Magic

Each of the 60 Transformer layers runs the Multi‑Head Latent Attention (MLA) mechanism. After linear projection, queries and keys are sent to act_quant(), which quantizes bfloat16 tensors to FP8, computing a scaling factor for every 128 elements.

The quantized queries and keys are processed by the fp8_index() kernel, which uses a two‑level blocking strategy (512 → 128) to compute attention scores in shared memory, applying a ReLU and the scaling factor.

Subsequently, the fp8_gemm() kernel performs FP8 matrix multiplication with a 32×128×128 block configuration, a four‑stage pipeline, and swizzling to improve L2 cache utilization. The resulting keys and values are stored in a KV cache for the upcoming decode stage.

MoE Experts Wisdom

After attention, the Mixture‑of‑Experts (MoE) feed‑forward network activates. The Gate layer selects the top‑k experts for each token. Chosen experts receive FP8‑quantized inputs and compute via the optimized GEMM kernel. Across the distributed setup, dist.all_reduce() aggregates expert outputs from all GPUs.

Finally, RMSNorm normalization and the lm_head linear layer produce logits over the vocabulary.

Act Three: Decode Stage – Token‑by‑Token Answer Generation

Having understood the prompt, the model now generates the answer autoregressively. Starting from the end of the prompt, each iteration processes only the new position, reusing the KV cache so that only the current query is computed.

The logits are passed to sample(), which first scales them by a temperature parameter, computes a softmax distribution, and then applies the Gumbel‑Max trick to sample the next token, balancing diversity and coherence.

The loop checks for an EOS token or a maximum length limit to terminate gracefully.

Epilogue: Answer Birth

The generated token sequence is fed to the tokenizer's decode() method, converting numbers back into fluent Chinese text. The answer appears on the terminal, providing a deep insight into Xue Baoqin's fate, and is appended to the dialogue history for the next round.

Summary

Distributed Parallelism : 8‑GPU collaboration via NCCL.

FP8 Quantization : act_quant(), fp8_gemm(), and fp8_index() kernels accelerate computation.

KV Cache : Prefill and decode stages reuse cached keys/values for speed.

MLA + MoE Architecture : 60‑layer Transformer with per‑layer MLA and Mixture‑of‑Experts feed‑forward networks.

Temperature Sampling : Gumbel‑Max technique ensures quality and diversity.

This narrative maps every architectural detail to concrete code implementations, illustrating how a 671‑billion‑parameter model runs efficiently on a GPU cluster.

LLM Distributed inference Mixture of Experts CUDA PyTorch MLA FP8 quantization

Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.