Why Reasoning and Tool-Use Clash in Agentic RL—and How DART Solves It
Recent studies reveal that in Agentic RL, jointly training reasoning and tool-use on shared parameters creates a persistent negative interaction, with gradients nearly orthogonal, limiting performance; a disentangled tuning approach (DART) using separate LoRA adapters isolates the two abilities and restores gains across benchmarks.
Agentic Reinforcement Learning (RL) has transformed large language models (LLMs) from single-turn question answerers into agents that repeatedly alternate between reasoning and external tool usage. A prevailing assumption is that reasoning and tool-use can be jointly optimized in a shared parameter space, yielding synergistic gains.
Empirical analysis, however, demonstrates a systematic negative interaction between these abilities. When a model learns both "how to reason" and "how to invoke tools" using shared parameters, performance on one task often degrades as the other improves, forming a seesaw effect. This phenomenon is consistently reproduced across multiple datasets (e.g., NQ, HotpotQA) and model scales, indicating that it is not an isolated anomaly.
Token‑level gradient analysis reveals that gradients from reasoning tokens and tool‑use tokens are almost orthogonal, with an angle close to 90°. This orthogonality implies that the two objectives pursue fundamentally different optima in parameter space, and joint updates converge to a compromised direction that is suboptimal for both.
Q1: Why does the gradient orthogonal phenomenon occur?
In high‑dimensional spaces, random vectors are nearly orthogonal with high probability. Reasoning and tool‑use tokens originate from distinct data distributions and objectives, producing gradient directions that resemble random vectors. Consequently, their near‑orthogonal relationship is a geometric norm rather than a surprising outlier.
Q2: Why doesn’t pre‑training exhibit the same conflict?
During pre‑training, all downstream tasks share the core language‑modeling objective, aligning gradients toward common low‑level linguistic improvements (lexical, syntactic, semantic). In contrast, post‑training for Agentic RL separates objectives: reasoning tokens aim to construct coherent thought chains, while tool tokens aim to generate accurate API calls and control‑flow decisions. The divergent goals generate distinct gradient subspaces, making orthogonal conflicts more likely.
Diagnostic Framework: LEAS
The authors introduce LEAS (Linear Effect Attribution System) to quantify interaction effects. By decomposing model capabilities into binary variables and adding interaction terms, they construct multiple model variants and solve a linear system to obtain interaction coefficients. Negative coefficients indicate interference between abilities.
LEAS experiments confirm that, across tool‑enhanced QA benchmarks, the interaction term between reasoning and tool‑use is negative for almost every question, overturning the belief that shared parameters inherently produce synergy.
Proposed Solution: DART
DART (Disentangled Action‑Reasoning Tuning) addresses the conflict by freezing the original backbone and attaching two independent LoRA adapters—one for reasoning, one for tool use. A token‑level router directs gradients from reasoning tokens to the reasoning LoRA and tool tokens to the tool LoRA, achieving explicit gradient isolation.
This design diverges from traditional multi‑task learning approaches that rely on loss weighting or gradient projection. Instead of seeking a compromise subspace, DART allocates separate low‑rank subspaces for each ability, allowing independent convergence while preserving a single‑model architecture.
Empirical results show that DART yields stable, significant improvements on multiple tool‑augmented QA benchmarks. For a 3B‑parameter model, DART outperforms the Search‑R1‑GRPO baseline by over 6% average exact‑match and achieves nearly 30% relative gain on multi‑hop reasoning tasks. Moreover, when retrieval results are held constant, DART still surpasses jointly trained models, indicating that gains stem from unhindered reasoning capability rather than better retrieval.
Compared with a 2‑Agent system—where separate models handle reasoning and tool decisions—DART matches most of the performance advantage while avoiding the substantial engineering overhead of duplicated models (e.g., increased memory, context switching, KV‑cache reconstruction). This makes DART especially valuable for real‑world deployment.
Broader Implications
The work highlights a principle often overlooked in agent system design: not all capabilities benefit from shared parameter training. When gradient conflicts are systematic, disentangling abilities in parameter space can be more effective than complex reward shaping or gradient correction. DART also redefines LoRA’s role from a mere efficient fine‑tuning tool to a modular mechanism for capability isolation.
Overall, the study provides a new perspective on performance bottlenecks in Agentic RL, suggesting that structural conflicts between abilities—not model size or reward design—may limit progress, and that explicit parameter‑space disentanglement offers a promising path forward.
Reasoning and Tool-use Compete in Agentic RL: From Quantifying Interference to Disentangled Tuning
Link: https://arxiv.org/abs/2602.00994Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
