Artificial Intelligence 14 min read

How TACO Lets CLI Agents Self‑Evolve to Drop Useless Context

TACO is a plug‑and‑play, training‑free framework that lets terminal‑based autonomous agents automatically learn compression rules to filter low‑value output while preserving critical decision cues, achieving higher task success rates and better token efficiency across multiple terminal‑related benchmarks.

Machine Heart

May 7, 2026

How TACO Lets CLI Agents Self‑Evolve to Drop Useless Context

Problem

Long‑running CLI agents accumulate large terminal outputs (error messages, file paths, test names, build targets, dependency versions). The growing output makes the context “dirty”, hiding the actionable signals needed for subsequent decisions.

TACO Framework

TACO (Terminal Agent Compression) is a training‑free, plug‑and‑play rule engine that observes raw terminal output and automatically derives compression rules. It operates in three tightly coupled stages:

1. Terminal Output Compression

For each interaction step the agent executes a command, receives raw output, and applies the current task’s active rules. Outputs containing explicit errors, failures, or diagnostic information are preserved; repetitive low‑value streams (installation progress, repeated test logs) are filtered.

2. Intra‑Task Rule Set Evolution

When no active rule can handle an output, TACO synthesises a new rule from the observed pattern and adds it to the task’s active rule set. Signals of over‑compression (e.g., the agent re‑requests the full output or repeats a command) trigger de‑activation of the offending rule and generation of a more conservative alternative.

3. Global Rule Pool Evolution

Effective rules discovered within a task are written back to a global rule pool. At the start of a new task TACO retrieves relevant rules from this pool to initialise the active set, enabling knowledge accumulation across diverse workflows.

Static Compression Baselines

Seed Rules : a few manually crafted rules for high‑output commands such as pip install, apt‑get, git clone.

High‑Quality Rules : an expanded manually curated rule set covering a broader range of commands, still static.

LLM Summarize : prompting a large language model to summarise the terminal output directly.

Experimental Evaluation

TACO was evaluated on TerminalBench 1.0, TerminalBench 2.0, and additional terminal‑related benchmarks (SWE‑Bench Lite, CompileBench, DevEval, CRUST‑Bench). Across all settings TACO consistently improved task success rates while reducing token consumption. Inserting TACO into Terminus‑2 yielded stable gains for strong models such as Qwen3‑Coder‑480B, DeepSeek‑V3.2, and MiniMax‑M2.5.

When token budgets were fixed, TACO still achieved higher accuracy than the baselines, indicating that the improvement stems from higher information density rather than more interaction steps.

Convergence of the self‑evolving process was measured with Retention – the overlap of the top‑K rules between consecutive evolution rounds. Retention values above 90 % correlated with a sharp drop in the rolling standard deviation of task accuracy, showing that a stable rule pool aligns with stable performance.

Case Studies

Installation log compression : In the apt‑get install -y r‑base task the raw output exceeded 10 000 characters. TACO evolved a rule that reduced the output to 73 characters, keeping only the installation status and error signals.

SQLite with gcov : The make output contained massive file‑copy lists and long compile commands. TACO removed the copy lists but retained coverage‑related flags ( -fprofile‑arcs, -ftest‑coverage) essential for downstream decisions.

Binary reverse‑engineering (vulnerable‑secret) : For objdump output TACO filtered repetitive hex‑dump lines while preserving call instructions, symbol labels, and key addresses needed for control‑flow analysis.

Analysis of Token vs. Accuracy

Static methods reduced token cost but showed unstable performance. The LLM‑Summarize baseline achieved the lowest token cost but a noticeable drop in accuracy. TACO’s token cost was higher than LLM‑Summarize but yielded the highest accuracy and the smallest variance, demonstrating that effective compression requires preserving critical decision‑making signals rather than aggressive shortening.

Cross‑Benchmark Generalisation

On SWE‑Bench Lite, DevEval, and CRUST‑Bench TACO improved accuracy while lowering total token consumption; on CompileBench accuracy remained unchanged but token usage dropped markedly. This indicates that the learned rules capture reusable compression patterns across heterogeneous terminal workflows.

Rule‑Stability as Convergence Signal

Instead of using test‑set accuracy (which would leak evaluation data), TACO monitors the stability of the Global Rule Pool. Retention is computed as the proportion of top‑K rules that remain unchanged between successive rounds. When Retention stabilises above 90 %, the rolling standard deviation of task accuracy also declines, providing a practical stopping criterion for self‑evolution.

Conclusion

TACO provides a training‑free, self‑evolving observation compression solution that enables CLI agents to automatically learn which terminal outputs can be safely discarded and which must be retained. By focusing on high‑value signals rather than raw token count, TACO improves both success rates and token efficiency across a wide range of benchmarks, demonstrating the practicality of self‑evolving rule‑based compression for long‑running autonomous agents.

For full details see the arXiv paper (arXiv:2604.19572), the Hugging Face Daily Paper entry, and the open‑source implementation at https://github.com/multimodal-art-projection/TACO.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM benchmark Code Intelligence Context Compression Terminal Agents Self‑Evolving Rules

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.