Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

The article details a hands‑on test of the Claude‑Opus‑4.6‑distilled Qwen3.5 27B model running on a single RTX 4090 via llama.cpp, showing a steady 46 tokens per second generation speed, a 64K context window, and a step‑by‑step Docker‑based setup while comparing it to GLM‑4.7‑Flash‑AWQ‑4bit and discussing llama.cpp’s limitations for multi‑GPU inference.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

The author experimentally runs the Claude‑Opus‑4.6‑distilled Qwen3.5 27B model on a single RTX 4090 using the llama.cpp GGUF format (Q4_K_M) and reports the full workflow and performance figures.

Model download

The model is fetched from ModelScope with the following command:

pip install modelscope
modelscope download --model Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF Qwen3.5-27B.Q4_K_M.gguf --local_dir .

llama.cpp installation

Because the author’s internal environment cannot compile llama.cpp successfully, a pre‑built Docker image is used:

docker pull ghcr.io/ggml-org/llama.cpp:full-cuda

Launch script

The container is started with GPU‑only execution, mounting the model directory and exposing port 8000. The command sets a large context size (‑c 65536) and enables 99 GPU layers (‑ngl 99):

docker run --rm --runtime nvidia --gpus "device=4" -v /data/llm-models:/models \
  --name qwen35-27 -p 8005:8000 ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/jackrong/Qwen3.5-27B.Q4_K_M.gguf --port 8000 --host 0.0.0.0 \
  -c 65536 -ngl 99

The built‑in UI is used for front‑end interaction, and it can be swapped for OpenWebUI which also supports MCP.

Performance and comparison

Average generation speed is about 46 tokens/s on a single RTX 4090.

The model comfortably handles up to a 64K context window (128K triggers OOM), which is far larger than the 10K limit of the compared GLM‑4.7‑Flash‑AWQ‑4bit model.

Qualitatively, the output is "medium" – it completes core tasks but lacks the fine‑grained detail of GLM‑4.7‑Flash.

Concurrency test

Attempts to enable concurrent inference failed. Even after adding flags such as -kvu, -flash-attn on, -b 1024, and increasing threads to -t 48, the server still could not handle parallel requests, suggesting the default four‑stream setting may be the practical limit.

Limitations of llama.cpp

The author notes that llama.cpp is not optimized for tensor parallelism or batch inference. It is best suited when part or all of the LLM workload is offloaded to the CPU. For multi‑GPU configurations that require efficient batch processing and tensor parallelism, vLLM is recommended instead.

Community feedback

Images from the LocalLLaMA community and llama.cpp issue threads illustrate common complaints about the project's lack of multi‑GPU support.

DockerLLM inferenceRTX 4090llama.cppClaude OpusQwen3.5
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.