Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance
The article details a hands‑on test of the Claude‑Opus‑4.6‑distilled Qwen3.5 27B model running on a single RTX 4090 via llama.cpp, showing a steady 46 tokens per second generation speed, a 64K context window, and a step‑by‑step Docker‑based setup while comparing it to GLM‑4.7‑Flash‑AWQ‑4bit and discussing llama.cpp’s limitations for multi‑GPU inference.
The author experimentally runs the Claude‑Opus‑4.6‑distilled Qwen3.5 27B model on a single RTX 4090 using the llama.cpp GGUF format (Q4_K_M) and reports the full workflow and performance figures.
Model download
The model is fetched from ModelScope with the following command:
pip install modelscope
modelscope download --model Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF Qwen3.5-27B.Q4_K_M.gguf --local_dir .llama.cpp installation
Because the author’s internal environment cannot compile llama.cpp successfully, a pre‑built Docker image is used:
docker pull ghcr.io/ggml-org/llama.cpp:full-cudaLaunch script
The container is started with GPU‑only execution, mounting the model directory and exposing port 8000. The command sets a large context size (‑c 65536) and enables 99 GPU layers (‑ngl 99):
docker run --rm --runtime nvidia --gpus "device=4" -v /data/llm-models:/models \
--name qwen35-27 -p 8005:8000 ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/jackrong/Qwen3.5-27B.Q4_K_M.gguf --port 8000 --host 0.0.0.0 \
-c 65536 -ngl 99The built‑in UI is used for front‑end interaction, and it can be swapped for OpenWebUI which also supports MCP.
Performance and comparison
Average generation speed is about 46 tokens/s on a single RTX 4090.
The model comfortably handles up to a 64K context window (128K triggers OOM), which is far larger than the 10K limit of the compared GLM‑4.7‑Flash‑AWQ‑4bit model.
Qualitatively, the output is "medium" – it completes core tasks but lacks the fine‑grained detail of GLM‑4.7‑Flash.
Concurrency test
Attempts to enable concurrent inference failed. Even after adding flags such as -kvu, -flash-attn on, -b 1024, and increasing threads to -t 48, the server still could not handle parallel requests, suggesting the default four‑stream setting may be the practical limit.
Limitations of llama.cpp
The author notes that llama.cpp is not optimized for tensor parallelism or batch inference. It is best suited when part or all of the LLM workload is offloaded to the CPU. For multi‑GPU configurations that require efficient batch processing and tensor parallelism, vLLM is recommended instead.
Community feedback
Images from the LocalLLaMA community and llama.cpp issue threads illustrate common complaints about the project's lack of multi‑GPU support.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
