Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide
This tutorial walks users through installing FlashMLA, integrating it with the vLLM framework, downloading the DeepSeek‑V2‑Lite‑Chat model, benchmarking various MLA implementations, and running a local inference demo that shows FlashMLA’s speed advantage on long‑sequence generation.
On February 25, DeepSeek‑AI open‑sourced FlashMLA, an efficient multi‑head latent attention (MLA) decoding kernel optimized for inference, which improves long‑sequence processing and inference speed of large language models.
Preparation
Access the "Experience FlashMLA Accelerated DeepSeek‑V2‑Lite Deployment" notebook on the PAI‑Notebook Gallery, open it in PAI‑DSW, and select a Hopper‑compatible environment (e.g., ecs.gn8v.4xlarge) with the recommended Docker image modelscope:1.23.1-pytorch2.5.1-gpu-py310-cu124-ubuntu22.04.
1. Install FlashMLA
Clone the repository and install the package:
!git clone https://github.com/deepseek-ai/FlashMLA.git
!cd FlashMLA && python setup.py developIf cloning fails, install from the cached archive:
# If the previous cell succeeded, this cell can be skipped
!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/FlashMLA.tgz && tar -xvzf FlashMLA.tgz
!cd FlashMLA && python setup.py developInstall additional dependencies required by the benchmark and vLLM integration:
# Install the vLLM wheel that contains FlashMLA fixes
!pip install https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
# Install flashinfer for MLA performance comparison
!pip install https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/flashinfer_python-0.2.2-py3-none-any.whl2. Apply FlashMLA in vLLM
The current vLLM release (v0.7.3) does not expose FlashMLA as a selectable backend, so a patched version is provided. Download, extract, and replace the relevant vLLM files:
!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/vllm_patch.tar && tar -xvf vllm_patch.tar
!cp -r vllm-patch/vllm/* /usr/local/lib/python3.10/site-packages/vllm/In the vLLM source, a new backend module implements FlashMLAImpl and FlashMLAMetadataBuilder, which call the underlying flash_mla_with_kvcache function. The cuda.py platform file selects this backend when the environment variable VLLM_ATTENTION_BACKEND=FLASHMLA is set and the hardware meets the requirements.
3. Download the Model
FlashMLA is effective for models that use MLA, such as DeepSeek‑V2‑Lite‑Chat. Download the model weights (or obtain them from ModelScope) and extract them:
import os
dsw_region = os.environ.get("dsw_region")
url_link = {"cn-shanghai": "https://atp-modelzoo-sh.oss-cn-shanghai-internal.aliyuncs.com/release/tutorials/flashmla/DeepSeek_v2_lite_chat.tar"}
path = url_link.get(dsw_region, "https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/DeepSeek_v2_lite_chat.tar")
os.environ['LINK_CHAT'] = path
!wget $LINK_CHAT
!tar -xvf DeepSeek_v2_lite_chat.tar4. Benchmark Different MLA Implementations
Use FlashMLA’s built‑in benchmark to compare forward‑pass performance of various MLA backends (torch, flash_mla, flash_infer, flash_mla_triton). The script measures bandwidth (GB/s) across sequence lengths; on the tested hardware FlashMLA shows roughly a 16 % improvement over flash‑infer.
import sys, os
sys.path.append(os.path.join(os.getcwd(), 'FlashMLA'))
from benchmark.bench_flash_mla import *
import matplotlib.pyplot as plt, pandas as pd
# ... (benchmark code omitted for brevity) ...
plt.title('bandwidth')
plt.xlabel('seqlen')
plt.ylabel('bw (GB/s)')
plt.show()5. Local Deployment Demo
Run a short script that generates a quicksort implementation using the FlashMLA‑enabled model. The log confirms the backend selection (e.g., [cuda.py:173] Using FlashMLA backend.). On an ecs.gn8v.4xlarge instance the model generates 515 tokens in 16.64 seconds, compared with 527 tokens in 17.97 seconds using the Triton MLA backend.
# Set backend
import os
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHMLA'
# Build LLM
from vllm import LLM, SamplingParams
model_name = "DeepSeek-V2-Lite-Chat"
llm = LLM(model=model_name, tensor_parallel_size=1, max_model_len=8192, trust_remote_code=True, enforce_eager=True, block_size=64)
# Warm‑up and inference
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.3, max_tokens=2048, stop_token_ids=[tokenizer.eos_token_id])
prompt = tokenizer.apply_chat_template([{"role": "user", "content": "Write a piece of quicksort code in C++."}], add_generation_prompt=True)
outputs = llm.generate(prompt_token_ids=[prompt], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)The generated C++ code correctly implements the QuickSort algorithm, demonstrating that FlashMLA can accelerate real‑world generation tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
