Deploying GLM‑4.7‑Flash Quantized Model Locally on a Single RTX 4090

This guide walks through downloading the AWQ‑4bit quantized GLM‑4.7‑Flash model, upgrading vLLM, building a custom Docker image, and launching the model on two RTX 4090 GPUs with tuned parameters to avoid OOM, while sharing practical tips and observed performance.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Deploying GLM‑4.7‑Flash Quantized Model Locally on a Single RTX 4090

1. Download the model

I chose the AWQ‑4bit quantized version because it supports vLLM, shrinks the original 58 GB checkpoint to 17 GB, and does not noticeably increase hallucinations.

modelscope download --model cyankiwi/GLM-4.7-Flash-AWQ-4bit
Model download page
Model download page

https://modelscope.cn/models/cyankiwi/GLM-4.7-Flash-AWQ-4bit/files

2. Upgrade vLLM to the nightly build

I did not adopt the official method, but the tutorial mentions it, so I tried.

Dependency conflicts and an outdated system environment caused many errors.

pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git

CUDA upgrades are covered in a separate tutorial (see the linked article).

The vLLM website (https://vllm.ai/) provides an interactive installer selector.

vLLM installer selector
vLLM installer selector

I chose the vLLM‑Docker approach.

Docker tags
Docker tags
docker pull vllm/vllm-openai:nightly

The nightly image does not yet support transformers 5.x, so I created a custom Dockerfile:

FROM vllm/vllm-openai:nightly
RUN pip install "transformers>=5.0.0rc2"
docker build -t glm-4.7-custom .

The resulting image ( glm-4.7-custom) is used to run the model.

3. Launch the model

I did not try the plain vLLM binary; instead I run it inside Docker.

CUDA_VISIBLE_DEVICES=0,1 vllm server \
  --model /data/models/GLM-4.7-Flash-AWQ-4bit \
  --tensor-parallel-size 2 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash

My Docker run script (using two RTX 4090 cards) looks like this:

docker run --rm --runtime=nvidia --gpus '"device=0,1"' \
  --name GLM-4.7-Flash -p 3004:8000 -p 5005:8000 \
  -v /data/models/GLM-4.7-Flash-AWQ-4bit:/models \
  glm-4.7-custom \
  --model /models/GLM-4.7-Flash-AWQ-4bit \
  --tensor-parallel-size 2 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --max-model-len 10240 \
  --max_num_seqs 10 \
  --host 0.0.0.0 \
  --port 8000
Docker run output
Docker run output

The container starts without errors and I connected it to OpenWebUI.

OpenWebUI interface
OpenWebUI interface

Observations: the model’s reasoning can exceed 30 seconds, which feels sluggish, but generation speed is excellent.

Generation speed screenshot
Generation speed screenshot

Memory usage is shown in the following chart:

GPU memory consumption
GPU memory consumption

The model handles internal‑network troubleshooting and even code generation quite well.

If you encounter looping or repetitive outputs, you can add the following generation parameters (I have not needed them personally):

--temp 1.0 --min-p 0.01 --top-p 0.95 --dry-multiplier 1.1
Final illustration
Final illustration
DockervLLMLocal DeploymentGLM-4.7-FlashAWQ-4bitQuantized LLM
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.