Deploy Hugging Face Transformers with One Click Using LMDeploy
This article explains how LMDeploy streamlines the deployment of Hugging Face transformer models by adding online conversion, offering an OpenAI‑compatible API server, a Gradio WebUI, and 4‑bit weight‑only quantization with AWQ, providing step‑by‑step commands, code examples, and performance insights.
Introduction
The Hugging Face ecosystem provides easy access to pretrained transformer models via the transformers library, but its inference pipeline lacks KV‑Cache management, causing each dialogue turn to re‑prefill the entire conversation history and reducing throughput. LMDeploy adds engineering optimizations that enable stable, high‑performance inference and a streamlined deployment workflow for Hugging Face models.
One‑Click Online Model Conversion
Before version v0.0.14, LMDeploy required an offline conversion step ( lmdeploy convert) to transform a model into TurboMind’s format. Starting with version v0.1.0a0, LMDeploy can convert models on‑the‑fly, allowing direct loading of Hugging Face checkpoints.
Install the latest release:
pip install 'lmdeploy[all]>=v0.1.0a0'API Server
LMDeploy ships an OpenAI‑compatible RESTful API server that can replace the official OpenAI endpoint.
lmdeploy serve api_server internlm/internlm-chat-20b --model-name internlm-chat-20bAfter launch, Swagger UI is reachable at http://0.0.0.0:23333. The first three endpoints mirror OpenAI’s API; the fourth provides LMDeploy’s interactive mode, which retains conversation history on the server and avoids redundant context decoding.
Web UI
A Gradio‑based Web UI is bundled with LMDeploy for interactive testing. It can be started independently or together with the API server.
lmdeploy serve gradio internlm/internlm-chat-20b --model-name internlm-chat-20bOffline Inference Example (TurboMind)
Loading a model with the TurboMind backend and performing streaming inference:
# load model
from lmdeploy import turbomind as tm
tm_model = tm.TurboMind.from_pretrained('internlm/internlm-chat-20b', model_name='internlm-chat-20b')
generator = tm_model.create_instance()
# process query
query = 'Hello! Today is sunny, it is time to go out'
prompt = tm_model.model.get_prompt(query)
input_ids = tm_model.tokenizer.encode(prompt)
# streaming inference
for outputs in generator.stream_infer(session_id=0, input_ids=[input_ids]):
res, tokens = outputs[0]
response = tm_model.tokenizer.decode(res.tolist())
print(response)4‑Bit Weight‑Only Quantization (AWQ)
Large models often exceed the memory capacity of consumer GPUs (e.g., a 7B model needs ~14 GB). LMDeploy implements the AWQ algorithm for 4‑bit weight‑only quantization, offering a simpler alternative to GPTQ while reducing memory footprint and accelerating inference.
Building a Quantized Model
Quantization consists of two stages: generating calibration parameters and applying them to the model weights.
Generate calibration parameters.
Apply the parameters to quantize the weights.
Commands:
# 1. Generate calibration parameters
lmdeploy lite calibrate \
--model $HF_MODEL \
--calib_dataset 'c4' \
--calib_samples 128 \
--calib_seqlen 2048 \
--work_dir $WORK_DIR
# 2. Quantize model weights
lmdeploy lite auto_awq \
--model $HF_MODEL \
--w_bits 4 \
--w_group_size 128 \
--work_dir $WORK_DIRUsing a Quantized Model
After quantization, the model can be served with the same API Server or Web UI commands, substituting the model identifier with the 4‑bit variant.
# API Server with 4‑bit model
lmdeploy serve api_server internlm/internlm-chat-20b-4bit --model-name internlm-chat-20b
# Web UI with 4‑bit model
lmdeploy serve gradio internlm/internlm-chat-20b-4bit --model-name internlm-chat-20bRepository and Resources
https://github.com/InternLM/lmdeploy
https://github.com/InternLM/lmdeploy#news-
https://huggingface.co/internlm
https://huggingface.co/lmdeploy
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
