How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM
This guide shows step‑by‑step how to enable or disable the thinking mode of Qwen3.5 series large language models across Ollama, LM Studio (GGUF and MLX), llama.cpp, and vLLM/SGLang using command‑line flags, custom model YAML files, and API parameters.
Qwen3.5 series large language models can be run with or without a "thinking" phase that outputs intermediate reasoning before the final answer. The article collects the most common ways to toggle this feature for four popular runtimes.
Ollama
Ollama provides the simplest switch: add --think=false to the run command to disable thinking, or omit it to keep thinking enabled.
Example:
ollama run qwen3.5:0.8b --think=falseLM Studio
For models in GGUF format, LM Studio uses a model.yaml file to expose a custom field named enableThinking. Setting its default value to false creates a “no‑think” variant.
model:
mlx-community/Qwen3.5-0.8B-MLX-4bit-no-think
base:
mlx-community/Qwen3.5-0.8B-MLX-4bit
metadataOverrides:
reasoning: false
customFields:
- key: enableThinking
displayName: "Enable Thinking"
description: "Whether to allow thinking output before the final answer"
type: boolean
defaultValue: false
effects:
- type: setJinjaVariable
variable: enable_thinkingAfter adding the file, LM Studio lists two models – the original and the virtual “no‑think” one – and the custom field appears in the UI, allowing the user to toggle thinking on or off.
When the switch is on, the model shows a long reasoning output; when off, it returns the answer instantly.
llama.cpp
The author has not tested this runtime, but according to an unsloth tutorial the default mode is non‑thinking. The command line for the two modes is:
# Non‑Thinking mode (default, recommended for daily use)
./build/bin/llama-server \
-m ./models/Qwen3.5-9B-Q4_K_M.gguf \
--ctx-size 16384 \
--port 8080 \
--n-gpu-layers 35
# Thinking mode
./build/bin/llama-server \
-m ./models/Qwen3.5-9B-Q4_K_M.gguf \
--ctx-size 16384 \
--port 8080 \
--n-gpu-layers 35 \
--chat-template-kwargs '{"enable_thinking":true}'vLLM / SGLang
During deployment the author could not find a runtime flag to disable thinking. The only available method is to set the flag when calling the API:
{
"chat_template_kwargs": {"enable_thinking": false}
}The article concludes with a brief thank‑you note.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
