Deploying GLM‑4.7‑Flash Quantized Model Locally on a Single RTX 4090
This guide walks through downloading the AWQ‑4bit quantized GLM‑4.7‑Flash model, upgrading vLLM, building a custom Docker image, and launching the model on two RTX 4090 GPUs with tuned parameters to avoid OOM, while sharing practical tips and observed performance.
1. Download the model
I chose the AWQ‑4bit quantized version because it supports vLLM, shrinks the original 58 GB checkpoint to 17 GB, and does not noticeably increase hallucinations.
modelscope download --model cyankiwi/GLM-4.7-Flash-AWQ-4bithttps://modelscope.cn/models/cyankiwi/GLM-4.7-Flash-AWQ-4bit/files
2. Upgrade vLLM to the nightly build
I did not adopt the official method, but the tutorial mentions it, so I tried.
Dependency conflicts and an outdated system environment caused many errors.
pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.gitCUDA upgrades are covered in a separate tutorial (see the linked article).
The vLLM website (https://vllm.ai/) provides an interactive installer selector.
I chose the vLLM‑Docker approach.
docker pull vllm/vllm-openai:nightlyThe nightly image does not yet support transformers 5.x, so I created a custom Dockerfile:
FROM vllm/vllm-openai:nightly
RUN pip install "transformers>=5.0.0rc2" docker build -t glm-4.7-custom .The resulting image ( glm-4.7-custom) is used to run the model.
3. Launch the model
I did not try the plain vLLM binary; instead I run it inside Docker.
CUDA_VISIBLE_DEVICES=0,1 vllm server \
--model /data/models/GLM-4.7-Flash-AWQ-4bit \
--tensor-parallel-size 2 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flashMy Docker run script (using two RTX 4090 cards) looks like this:
docker run --rm --runtime=nvidia --gpus '"device=0,1"' \
--name GLM-4.7-Flash -p 3004:8000 -p 5005:8000 \
-v /data/models/GLM-4.7-Flash-AWQ-4bit:/models \
glm-4.7-custom \
--model /models/GLM-4.7-Flash-AWQ-4bit \
--tensor-parallel-size 2 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--max-model-len 10240 \
--max_num_seqs 10 \
--host 0.0.0.0 \
--port 8000The container starts without errors and I connected it to OpenWebUI.
Observations: the model’s reasoning can exceed 30 seconds, which feels sluggish, but generation speed is excellent.
Memory usage is shown in the following chart:
The model handles internal‑network troubleshooting and even code generation quite well.
If you encounter looping or repetitive outputs, you can add the following generation parameters (I have not needed them personally):
--temp 1.0 --min-p 0.01 --top-p 0.95 --dry-multiplier 1.1Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
