Artificial Intelligence 33 min read

The Bitter Lesson of Building Agentic RL in Terminal Environments

This article recounts the challenges of moving from single‑step RL with verifiable rewards to multi‑step agentic reinforcement learning in terminal environments, detailing infrastructure design, asynchronous pipelines, data quality checks, masking strategies, curriculum training, chunk‑based optimization, and practical lessons learned from large‑scale experiments.

Baobao Algorithm Notes

Feb 24, 2026

The Bitter Lesson of Building Agentic RL in Terminal Environments

Motivation

Traditional RL with verifiable rewards (RLVR) treats each answer generation as a single‑step bandit problem. Agentic RL requires a model to act over multiple steps, observe a changing terminal environment, and assign credit across long horizons under sparse delayed rewards.

Environment Manager

We built an environment manager inside the ROLL framework that cleanly separates three components:

ROLL – training framework that drives the rollout loop (reset → decision → execution → termination) and stores trajectory data.

iFlow CLI – agent framework that maintains session state, tool‑call history and provides an API for the model.

ROCK – sandbox manager that creates sandbox sessions, runs commands, uploads files, and handles lifecycle events.

Two operating modes are supported:

Roll‑Managed Mode : ROLL handles context construction and interacts with the sandbox via tool‑call interfaces (components: TrajEnvManagerTB, TerminalBenchEnv, SandboxManager, IFlowCLITool).

CLI‑Native Mode : The iFlow CLI owns all context, sessions and history; ROLL only invokes the CLI through a lightweight ModelProxy Service that provides asynchronous queue‑based messaging.

Asynchronous Training Pipeline

Agentic rollouts exhibit long‑tail latency (few rollouts take much longer due to large generated texts or slow environment interactions). To avoid straggler bottlenecks we implement a fully asynchronous pipeline:

Environment‑level asynchronous rollout – decouples LLM generation, environment interaction and reward computation.

Redundant parallel environments – increase the number and size of environment groups to prevent fail‑slow or fail‑stop failures.

Asynchronous training – rollout and gradient computation run on separate devices.

Train‑rollout reuse – time‑division multiplexing shares GPU resources between inference and training.

Data Quality and Filtering

Training instances come from large‑scale synthetic generation and expert‑written tasks. Early synthetic data contained ~40% false positives (incomplete or incorrect test scripts). We mitigate this with an LLM‑as‑judge module that reviews each instruction–test pair and discards high‑risk samples.

Before adding an instance to the RL pool we perform:

Ground‑truth validation : discard if the golden solution cannot pass all tests.

No‑op validation : discard if the task can be solved without any meaningful action.

Environment Cleanliness and Augmentation

Even tiny artifacts (temporary files, cached packages) can leak information. We therefore clean intermediate files before each rollout and upload test files only during final evaluation. To improve robustness we deliberately randomize the initial environment (different package versions, mirrors, configuration) and sometimes perturb or break the environment so the agent must diagnose and recover.

Mask & Filter Strategy

Terminal environments produce transient network failures, sandbox start‑ups, or tool timeouts. We treat harmful samples with two mechanisms:

Masking – for unrecoverable or large‑scale errors (e.g., env_init_failed, sandbox_unavailable) we replace the rollout with a placeholder whose gradients are zeroed out.

Filtering – for recoverable errors (e.g., tool timeout) we drop the sample, capping the global filter ratio at 50% to avoid over‑filtering.

def handle_rollout_with_mask(rollout, failure_type):
    if failure_type in {"env_init_failed", "sandbox_unavailable", "env_reset_failed", "reward_calculation_failed"}:
        placeholder = create_placeholder_rollout()
        placeholder.response_mask[:] = 0
        placeholder.advantages[:] = 0
        placeholder.rewards[:] = 0
        placeholder.meta["masked"] = True
        return placeholder
    return rollout

class GroupFilterTB:
    def __init__(self, config, env_manager_config, mode: str):
        self.global_filter_stats = {"total": 0, "filtered": 0}
    def filter(self, group_id, episode_id, group):
        self.global_filter_stats["total"] += 1
        should_drop = any(d.meta_info.get("drop_flag", False) for d in group)
        if not should_drop:
            return False
        ratio = self.global_filter_stats["filtered"] / max(self.global_filter_stats["total"], 1)
        if ratio >= 0.5:
            return False
        self.global_filter_stats["filtered"] += 1
        return True

Curriculum‑Style Training

We start with only positive‑sample trajectories to learn a stable policy manifold. Once a small high‑quality expert dataset is available we gradually introduce negative trajectories to improve generalization.

Chunked MDP and Interaction‑Perceptive Agentic Policy Optimization (IPA)

Instead of token‑level optimization we treat each interaction chunk (from one tool call to the next) as a semantic action unit. IPA computes returns and importance sampling at the chunk level, masks entire chunks when the policy deviates, and mixes imitation learning with RL to handle difficult tasks.

Chunk‑level return estimation provides more stable gradients for long horizons.

Chunk masking aligns with coarse‑grained reward structures.

Chunk re‑sampling and mixed IL+RL broaden learning on hard instances.

Why Agentic RL Is Hard

Long‑tail distribution of extremely long failure trajectories that dominate gradients.

Shallow policies that exploit shortcuts rather than truly solving the task.

Noisy failures caused by environment randomness or system‑level interference.

We address these with selective trajectory masking, token masking, trajectory‑level reweighting, retry‑loop penalties, and other lightweight behavior‑shaping rewards.

Crash Is Normal – Resuming Training

When a few extreme failure trajectories dominate updates (e.g., responses >20k tokens) we mask them and reduce their weight. If negative samples become dominant later we globally re‑weight them to keep their contribution below a threshold.

Fine‑Grained Monitoring

We continuously track per‑task success rates, tool success/failure rates, repeated tool‑call loops, command frequencies, etc. Sudden spikes trigger rollbacks or instance removal.

Parallel Function Calls

Models such as Claude‑Sonnet‑4.5 frequently issue multiple parallel “check” calls (e.g., pwd, ls, cat, python -V, pip list, read_file, search) before taking action, demonstrating the value of a pre‑action information‑gathering phase.

Common Failure Modes

Unproductive loops where the agent repeats the same failing strategy.

Timeouts caused by poor perception of long‑running commands.

Hallucinations, inappropriate tool selection, and constraint violations.

Future Directions

Agentic RL in terminal environments is essentially a partially observable MDP with long credit‑assignment horizons. Future work should explore:

More complex long‑horizon tasks and richer agentic behavior patterns.

Closed‑loop optimization involving agents, environments, and humans (online RL / human‑in‑the‑loop).

Open, evolving sandbox environments and reward signals that leverage intermediate tool feedback.

Stronger infrastructure for high‑concurrency, low‑latency sandbox execution and scalable training frameworks.

References

RL training framework: https://github.com/alibaba/ROLL

Sandbox manager: https://github.com/alibaba/ROCK

Agent framework: https://github.com/iflow-ai/iflow-cli

Benchmarks: https://github.com/alibaba/terminal-bench-pro

reinforcement learning asynchronous training Agentic RL Credit Assignment Environment Augmentation RL Infrastructure Terminal Environments

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.