Artificial Intelligence 10 min read

ICML 2026: Teaching Large Models to Think and Speak – Turning “When to Speak” into a Learnable Strategy

The paper “When to Think, When to Speak” introduces Side‑by‑Side Interleaved Reasoning, a learnable disclosure policy that lets LLMs alternate between internal thinking and user‑visible answer fragments, reducing content latency while preserving or improving accuracy on math and scientific QA benchmarks.

Machine Heart

May 18, 2026

ICML 2026: Teaching Large Models to Think and Speak – Turning “When to Speak” into a Learnable Strategy

Current streaming LLM interfaces expose every generated token to the user, creating a “silence tax”: waiting longer yields more reliable answers, while answering early risks premature commitments. This coupling harms tasks that require lengthy intermediate reasoning such as math, science, and code.

The authors propose Side‑by‑Side (SxS) Interleaved Reasoning , which turns the decision “when to disclose” into a learnable policy. Within a single autoregressive context the model can emit two kinds of tokens: think tokens that continue internal reasoning and are not shown to the user, and speak tokens that reveal answer fragments that are already supported by the current reasoning prefix. This creates a controllable visibility stream without changing the underlying architecture.

The training pipeline has three stages. First, the authors construct entailment‑aligned interleaved trajectories by splitting standard prompt → reasoning → response triples into fragments and labeling which answer prefixes are entailed by the current reasoning prefix. Second, supervised fine‑tuning (SFT) teaches the model the think / speak format. Third, reinforcement learning with the GRPO algorithm restores any accuracy loss caused by the new token distribution, rewarding correct final answers while encouraging early, supported disclosures.

Experiments on two Qwen‑3 families—MoE Qwen3‑30B‑A3B and dense Qwen3‑4B—cover the AIME‑25 math benchmark and the cross‑domain scientific QA set GPQA‑Diamond. Besides final accuracy, the authors report Average Inter‑Response Wait (AIRW) , a token‑level proxy for content latency. Notable results include:

Qwen3‑4B SxS RL Final: 80.0 % accuracy on AIME‑25 (vs 73.8 % for standard CoT) and AIRW reduced from 21,316 to 8,519 tokens.

GPQA‑Diamond: 49.3 % accuracy (vs 19.0 %) and AIRW reduced from 16,338 to 7,738 tokens.

Additional analyses on LiveCodeBench and KOR‑Bench show the same trend: SxS does not always maximize raw accuracy, but it consistently yields better post‑training behavior and lower content latency, especially for smaller models.

The approach’s practical value lies in turning a UI‑level engineering problem into a model‑level learning problem. Products can offer earlier, trustworthy updates without architectural changes, and training pipelines gain a new objective—learning when to speak responsibly.

Limitations are acknowledged. AIRW is a token‑level proxy and does not capture real wall‑clock latency, which depends on batching, network, and front‑end rendering. SFT‑only interleaved models suffer noticeable accuracy drops, so the RL recovery stage is essential and adds training cost. Finer‑grained disclosure improves perceived responsiveness but incurs higher training overhead, and the optimal trade‑off remains between accuracy and content‑delay Pareto efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM reinforcement learning Qwen3 CoT content latency disclosure policy interleaved reasoning

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.