Operations 9 min read

Deploy Hugging Face Transformers with One Click Using LMDeploy

This article explains how LMDeploy streamlines the deployment of Hugging Face transformer models by adding online conversion, offering an OpenAI‑compatible API server, a Gradio WebUI, and 4‑bit weight‑only quantization with AWQ, providing step‑by‑step commands, code examples, and performance insights.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Deploy Hugging Face Transformers with One Click Using LMDeploy

Introduction

The Hugging Face ecosystem provides easy access to pretrained transformer models via the transformers library, but its inference pipeline lacks KV‑Cache management, causing each dialogue turn to re‑prefill the entire conversation history and reducing throughput. LMDeploy adds engineering optimizations that enable stable, high‑performance inference and a streamlined deployment workflow for Hugging Face models.

One‑Click Online Model Conversion

Before version v0.0.14, LMDeploy required an offline conversion step ( lmdeploy convert) to transform a model into TurboMind’s format. Starting with version v0.1.0a0, LMDeploy can convert models on‑the‑fly, allowing direct loading of Hugging Face checkpoints.

Install the latest release:

pip install 'lmdeploy[all]>=v0.1.0a0'

API Server

LMDeploy ships an OpenAI‑compatible RESTful API server that can replace the official OpenAI endpoint.

lmdeploy serve api_server internlm/internlm-chat-20b --model-name internlm-chat-20b

After launch, Swagger UI is reachable at http://0.0.0.0:23333. The first three endpoints mirror OpenAI’s API; the fourth provides LMDeploy’s interactive mode, which retains conversation history on the server and avoids redundant context decoding.

Web UI

A Gradio‑based Web UI is bundled with LMDeploy for interactive testing. It can be started independently or together with the API server.

lmdeploy serve gradio internlm/internlm-chat-20b --model-name internlm-chat-20b

Offline Inference Example (TurboMind)

Loading a model with the TurboMind backend and performing streaming inference:

# load model
from lmdeploy import turbomind as tm
tm_model = tm.TurboMind.from_pretrained('internlm/internlm-chat-20b', model_name='internlm-chat-20b')
generator = tm_model.create_instance()

# process query
query = 'Hello! Today is sunny, it is time to go out'
prompt = tm_model.model.get_prompt(query)
input_ids = tm_model.tokenizer.encode(prompt)

# streaming inference
for outputs in generator.stream_infer(session_id=0, input_ids=[input_ids]):
    res, tokens = outputs[0]
    response = tm_model.tokenizer.decode(res.tolist())
    print(response)

4‑Bit Weight‑Only Quantization (AWQ)

Large models often exceed the memory capacity of consumer GPUs (e.g., a 7B model needs ~14 GB). LMDeploy implements the AWQ algorithm for 4‑bit weight‑only quantization, offering a simpler alternative to GPTQ while reducing memory footprint and accelerating inference.

Building a Quantized Model

Quantization consists of two stages: generating calibration parameters and applying them to the model weights.

Generate calibration parameters.

Apply the parameters to quantize the weights.

Commands:

# 1. Generate calibration parameters
lmdeploy lite calibrate \
  --model $HF_MODEL \
  --calib_dataset 'c4' \
  --calib_samples 128 \
  --calib_seqlen 2048 \
  --work_dir $WORK_DIR

# 2. Quantize model weights
lmdeploy lite auto_awq \
  --model $HF_MODEL \
  --w_bits 4 \
  --w_group_size 128 \
  --work_dir $WORK_DIR

Using a Quantized Model

After quantization, the model can be served with the same API Server or Web UI commands, substituting the model identifier with the 4‑bit variant.

# API Server with 4‑bit model
lmdeploy serve api_server internlm/internlm-chat-20b-4bit --model-name internlm-chat-20b

# Web UI with 4‑bit model
lmdeploy serve gradio internlm/internlm-chat-20b-4bit --model-name internlm-chat-20b

Repository and Resources

https://github.com/InternLM/lmdeploy

https://github.com/InternLM/lmdeploy#news-

https://huggingface.co/internlm

https://huggingface.co/lmdeploy

LMDeploy architecture diagram
LMDeploy architecture diagram
Quantization workflow diagram
Quantization workflow diagram
Inference performance comparison
Inference performance comparison
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Model DeploymentAI inferenceAPI ServerHugging FaceLMDeployTurboMind
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.