Artificial Intelligence 18 min read

ICML 2026: From Single‑Threaded Thinking to Native Parallel Reasoning in Agents

The paper introduces Native Parallel Reasoner (NPR), a framework that lets language agents generate and maintain multiple reasoning paths using a three‑stage self‑distillation and parallel reinforcement‑learning training paradigm, achieving up to 4.6× speedup and significant accuracy gains across eight reasoning benchmarks.

Machine Heart

May 18, 2026

ICML 2026: From Single‑Threaded Thinking to Native Parallel Reasoning in Agents

Background and Motivation

Large language models have excelled at producing long, fluent text, but traditional chain‑of‑thought reasoning struggles with tasks that require simultaneous exploration of multiple solution paths, self‑reflection, and aggregation. Sequential generation is inefficient, prone to early‑stage bias, and limited by serial computation.

Native Parallel Reasoner (NPR)

The BIGAI NLCo team proposes NPR, a native parallel reasoning engine that enables agents to spawn and maintain several candidate reasoning paths in a single inference pass, performing a "branch + aggregate" operation to synthesize the optimal answer.

Key Innovation

Beyond engineering tricks, NPR introduces a three‑stage training paradigm—self‑distillation plus parallel reinforcement learning—paired with a dedicated parallel reasoning engine, turning parallel inference from an external strategy into an intrinsic model capability.

Three‑Stage Training Paradigm

Stage 1: Format‑following RL (NPR‑ZERO) – Using DAPO‑style reinforcement learning, the model learns to output a structured parallel format (e.g., <guideline>, <plan>, <step>, <takeaway>) without any external parallel examples.

Stage 2: Self‑Distillation & Parallel Warm‑up (NPR‑BETA) – Rejection sampling with strict filters (outcome correctness and structured parallelism) selects high‑quality trajectories from NPR‑ZERO. These are used for cold‑start parallel SFT, introducing Parallel‑Aware Attention Masks and Parallel Positional Encoding to reuse KV‑Cache across branches.

Stage 3: Native‑Parallel RL (PAPO) – Parallel‑Aware Policy Optimization (PAPO) directly optimizes branch‑selection policies on the NPR‑Engine, preserving gradients for special tokens and abandoning importance sampling to maintain stable on‑policy updates.

Technical Details

Self‑Distillation & Strict Filtering – Only trajectories satisfying both outcome correctness and schema compliance are kept for further training, reducing noise and ensuring learnable parallel data.

Parallel Attention Mask & Positional Encoding – Enables multiple reasoning paths in a single forward pass while sharing KV‑Cache, preventing redundant computation of common prefixes.

Parallel‑Aware Policy Optimization (PAPO) – Introduces parallel rollout, structured filtering, batch‑level advantage normalization, and gradient preservation for special tokens, addressing the failure of vanilla PPO/DAPO under parallel semantics.

Engineering Improvements (NPR‑Engine)

Budget‑aware KV‑cache recycling to avoid double‑free errors.

Branch‑aware token budgeting that accumulates token usage across active branches.

Format pre‑check and lightweight invariance enforcement to guarantee deterministic branch expansion.

Experiments and Results

Evaluated on eight reasoning benchmarks (AIME24/25, HMMT25, OlympiadBench, Minerva‑Math, ZebraLogic, AMC23, MATH500). Key findings:

Replacing Multiverse data with self‑extracted NPR‑BETA data raised average scores from 50.1 to 59.0 (+8.9).

Parallel SFT (NPR‑BETA) improved accuracy over sequential SFT by 0.8–5.8 points across datasets.

Parallel RL added further gains, pushing average from 62.0 to 65.0 (+3.0).

Parallel trigger rate reached 100 % on all datasets, whereas Multiverse showed large variance.

Speedup ranged from 2.9× on easy tasks (AMC23) to 4.6× on hard tasks (AIME25), demonstrating that deeper exploration benefits more from parallelism.

Overall, NPR consistently outperformed Multiverse (1.3–2.4× faster) and autoregressive baselines, with no evidence of pseudo‑parallel behavior.

Case Study

For geometry problems, NPR generates multiple independent <plan> statements (algebraic, numeric, geometric), expands each via <step>, and merges results in <takeaway>, discarding inconsistent branches and producing a boxed answer.

Conclusion

The proposed framework offers a simple, scalable way to build native parallel reasoners that learn parallel planning and aggregation without external teacher models. Experiments across eight benchmarks show significant accuracy improvements, inference acceleration, and robust parallel trigger rates, indicating that native parallel reasoning is a promising direction for more general and extensible AI agents.

References

Wei et al. (2022) – Chain of Thought Prompting; Dean et al. (2004) – MapReduce; Snell et al. (2025); Yang et al. (2025) – Multiverse; Zhao et al. (2025) – Absolute Zero; Yu et al. (2025) – DAPO; Gilks et al. (2018); Sutton et al. (1999); Zheng et al. (2024); Schulman et al. (2017).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models benchmark evaluation Self‑Distillation AI reasoning Native Parallel Reasoner parallel reinforcement learning

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.