Artificial Intelligence 13 min read

How Qwen3 Controls Hybrid Reasoning with the enable_thinking Parameter

This article explains how Qwen3 implements hybrid (fast/slow) reasoning by using the enable_thinking flag in the tokenizer's apply_chat_template method, detailing the underlying Jinja2 chat template, example prompts, the effect of toggling the flag, and design considerations for future autonomous thinking control.

Architect

May 14, 2025

How Qwen3 Controls Hybrid Reasoning with the enable_thinking Parameter

Hybrid Reasoning in Qwen3

Hybrid reasoning models that combine a slow, chain‑of‑thought (CoT) stage with a fast response stage are becoming common. Qwen3 is an open‑source example that implements this hybrid behavior through a simple tokenizer flag.

enable_thinking Parameter

Qwen3 adds a boolean parameter enable_thinking to the tokenizer.apply_chat_template() method. When enable_thinking=True (the default) the model generates a <think> … </think> block before the final answer. Setting the flag to False suppresses the thinking stage and the model replies directly.

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # True is the default value
)

Chat Template Construction

The apply_chat_template() function builds a ChatML conversation template that inserts special tokens to delimit roles. For example, given the raw messages:

# Input messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me about large language models."},
    {"role": "assistant", "content": "Sure, large language models are …"}
]

The resulting prompt (with add_generation_prompt=True) is:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Tell me about large language models.<|im_end|>
<|im_start|>assistant
Sure, large language models are …<|im_end|>

Effect of enable_thinking=False

When the flag is disabled, the template inserts an empty thinking block immediately after the assistant start token:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Tell me about large language models.<|im_end|>
<|im_start|>assistant
<think>

</think>

Jinja2 Template Snippet

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant
' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>

</think>

' }}
    {%- endif %}
{%- endif %}

This snippet is the only place where enable_thinking influences the prompt: it either adds a normal <think> … </think> block (when the flag is true) or an empty block (when false), thereby toggling the hybrid reasoning mode.

Switching Between Modes

Default behavior ( enable_thinking=True) inserts a populated <think> … </think> segment, prompting the model to perform a CoT step before answering.

Setting enable_thinking=False injects an empty <think></think> segment, signalling that the thinking stage is finished and the model should produce a fast reply.

Soft Start/Stop Tokens

Beyond the flag, Qwen3 supports per‑turn control via special tokens /think and /no_think. Placing these tokens in a system or user message toggles the thinking mode for the subsequent generation without changing the global flag.

Training Pipeline

In Stage 3 of Qwen3’s post‑training pipeline, chain‑of‑thought data and regular instruction data are mixed. This joint fine‑tuning teaches the model to generate both the thinking block and the final answer, enabling seamless switching between slow and fast inference.

In the third stage, a mixture of long‑chain‑of‑thought data and standard instruction data is used to fine‑tune the model, integrating non‑thinking mode into the thinking model and ensuring a smooth combination of reasoning and quick response capabilities.

Future Directions

Further reinforcement‑learning (RL) fine‑tuning could allow the model to decide autonomously when to activate the thinking stage, potentially leading to emergent behavior where the model judges task difficulty and self‑selects the appropriate inference mode.

References:

Official blog: https://qwenlm.github.io/zh/blog/qwen3/#%E5%BC%80%E5%A7%8B%E4%BD%BF%E7%94%A8-qwen3

Tokenizer config (Jinja2 template): https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/tokenizer_config.json

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Prompt Engineering AI model tokenizer Qwen3 Hybrid Reasoning ChatML enable_thinking

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.