Artificial Intelligence 7 min read

Deploying GLM‑4.7‑Flash Quantized Model Locally on a Single RTX 4090

This guide walks through downloading the AWQ‑4bit quantized GLM‑4.7‑Flash model, upgrading vLLM, building a custom Docker image, and launching the model on two RTX 4090 GPUs with tuned parameters to avoid OOM, while sharing practical tips and observed performance.

Old Zhang's AI Learning

Jan 29, 2026

Deploying GLM‑4.7‑Flash Quantized Model Locally on a Single RTX 4090

1. Download the model

I chose the AWQ‑4bit quantized version because it supports vLLM, shrinks the original 58 GB checkpoint to 17 GB, and does not noticeably increase hallucinations.

modelscope download --model cyankiwi/GLM-4.7-Flash-AWQ-4bit

https://modelscope.cn/models/cyankiwi/GLM-4.7-Flash-AWQ-4bit/files

2. Upgrade vLLM to the nightly build

I did not adopt the official method, but the tutorial mentions it, so I tried.

Dependency conflicts and an outdated system environment caused many errors.

pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git

CUDA upgrades are covered in a separate tutorial (see the linked article).

The vLLM website (https://vllm.ai/) provides an interactive installer selector.

I chose the vLLM‑Docker approach.

docker pull vllm/vllm-openai:nightly

The nightly image does not yet support transformers 5.x, so I created a custom Dockerfile:

FROM vllm/vllm-openai:nightly
RUN pip install "transformers>=5.0.0rc2"

docker build -t glm-4.7-custom .

The resulting image ( glm-4.7-custom) is used to run the model.

3. Launch the model

I did not try the plain vLLM binary; instead I run it inside Docker.

CUDA_VISIBLE_DEVICES=0,1 vllm server \
  --model /data/models/GLM-4.7-Flash-AWQ-4bit \
  --tensor-parallel-size 2 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash

My Docker run script (using two RTX 4090 cards) looks like this:

docker run --rm --runtime=nvidia --gpus '"device=0,1"' \
  --name GLM-4.7-Flash -p 3004:8000 -p 5005:8000 \
  -v /data/models/GLM-4.7-Flash-AWQ-4bit:/models \
  glm-4.7-custom \
  --model /models/GLM-4.7-Flash-AWQ-4bit \
  --tensor-parallel-size 2 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --max-model-len 10240 \
  --max_num_seqs 10 \
  --host 0.0.0.0 \
  --port 8000

The container starts without errors and I connected it to OpenWebUI.

Observations: the model’s reasoning can exceed 30 seconds, which feels sluggish, but generation speed is excellent.

Memory usage is shown in the following chart:

The model handles internal‑network troubleshooting and even code generation quite well.

If you encounter looping or repetitive outputs, you can add the following generation parameters (I have not needed them personally):

--temp 1.0 --min-p 0.01 --top-p 0.95 --dry-multiplier 1.1

Docker vLLM Local Deployment GLM-4.7-Flash AWQ-4bit Quantized LLM

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.