Artificial Intelligence 16 min read

Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide

This tutorial walks users through installing FlashMLA, integrating it with the vLLM framework, downloading the DeepSeek‑V2‑Lite‑Chat model, benchmarking various MLA implementations, and running a local inference demo that shows FlashMLA’s speed advantage on long‑sequence generation.

Alibaba Cloud Big Data AI Platform

Feb 25, 2025

Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide

On February 25, DeepSeek‑AI open‑sourced FlashMLA, an efficient multi‑head latent attention (MLA) decoding kernel optimized for inference, which improves long‑sequence processing and inference speed of large language models.

Preparation

Access the "Experience FlashMLA Accelerated DeepSeek‑V2‑Lite Deployment" notebook on the PAI‑Notebook Gallery, open it in PAI‑DSW, and select a Hopper‑compatible environment (e.g., ecs.gn8v.4xlarge) with the recommended Docker image modelscope:1.23.1-pytorch2.5.1-gpu-py310-cu124-ubuntu22.04.

1. Install FlashMLA

Clone the repository and install the package:

!git clone https://github.com/deepseek-ai/FlashMLA.git
!cd FlashMLA && python setup.py develop

If cloning fails, install from the cached archive:

# If the previous cell succeeded, this cell can be skipped
!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/FlashMLA.tgz && tar -xvzf FlashMLA.tgz
!cd FlashMLA && python setup.py develop

Install additional dependencies required by the benchmark and vLLM integration:

# Install the vLLM wheel that contains FlashMLA fixes
!pip install https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
# Install flashinfer for MLA performance comparison
!pip install https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/flashinfer_python-0.2.2-py3-none-any.whl

2. Apply FlashMLA in vLLM

The current vLLM release (v0.7.3) does not expose FlashMLA as a selectable backend, so a patched version is provided. Download, extract, and replace the relevant vLLM files:

!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/vllm_patch.tar && tar -xvf vllm_patch.tar
!cp -r vllm-patch/vllm/* /usr/local/lib/python3.10/site-packages/vllm/

In the vLLM source, a new backend module implements FlashMLAImpl and FlashMLAMetadataBuilder, which call the underlying flash_mla_with_kvcache function. The cuda.py platform file selects this backend when the environment variable VLLM_ATTENTION_BACKEND=FLASHMLA is set and the hardware meets the requirements.

3. Download the Model

FlashMLA is effective for models that use MLA, such as DeepSeek‑V2‑Lite‑Chat. Download the model weights (or obtain them from ModelScope) and extract them:

import os
dsw_region = os.environ.get("dsw_region")
url_link = {"cn-shanghai": "https://atp-modelzoo-sh.oss-cn-shanghai-internal.aliyuncs.com/release/tutorials/flashmla/DeepSeek_v2_lite_chat.tar"}
path = url_link.get(dsw_region, "https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/DeepSeek_v2_lite_chat.tar")
os.environ['LINK_CHAT'] = path
!wget $LINK_CHAT
!tar -xvf DeepSeek_v2_lite_chat.tar

4. Benchmark Different MLA Implementations

Use FlashMLA’s built‑in benchmark to compare forward‑pass performance of various MLA backends (torch, flash_mla, flash_infer, flash_mla_triton). The script measures bandwidth (GB/s) across sequence lengths; on the tested hardware FlashMLA shows roughly a 16 % improvement over flash‑infer.

import sys, os
sys.path.append(os.path.join(os.getcwd(), 'FlashMLA'))
from benchmark.bench_flash_mla import *
import matplotlib.pyplot as plt, pandas as pd
# ... (benchmark code omitted for brevity) ...
plt.title('bandwidth')
plt.xlabel('seqlen')
plt.ylabel('bw (GB/s)')
plt.show()

5. Local Deployment Demo

Run a short script that generates a quicksort implementation using the FlashMLA‑enabled model. The log confirms the backend selection (e.g., [cuda.py:173] Using FlashMLA backend.). On an ecs.gn8v.4xlarge instance the model generates 515 tokens in 16.64 seconds, compared with 527 tokens in 17.97 seconds using the Triton MLA backend.

# Set backend
import os
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHMLA'
# Build LLM
from vllm import LLM, SamplingParams
model_name = "DeepSeek-V2-Lite-Chat"
llm = LLM(model=model_name, tensor_parallel_size=1, max_model_len=8192, trust_remote_code=True, enforce_eager=True, block_size=64)
# Warm‑up and inference
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.3, max_tokens=2048, stop_token_ids=[tokenizer.eos_token_id])
prompt = tokenizer.apply_chat_template([{"role": "user", "content": "Write a piece of quicksort code in C++."}], add_generation_prompt=True)
outputs = llm.generate(prompt_token_ids=[prompt], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

The generated C++ code correctly implements the QuickSort algorithm, demonstrating that FlashMLA can accelerate real‑world generation tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python vllm DeepSeek MLA FlashMLA InferenceOptimization

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.