How TACO Lets CLI Agents Self‑Evolve to Drop Useless Context
TACO is a plug‑and‑play, training‑free framework that lets terminal‑based autonomous agents automatically learn compression rules to filter low‑value output while preserving critical decision cues, achieving higher task success rates and better token efficiency across multiple terminal‑related benchmarks.
Problem
Long‑running CLI agents accumulate large terminal outputs (error messages, file paths, test names, build targets, dependency versions). The growing output makes the context “dirty”, hiding the actionable signals needed for subsequent decisions.
TACO Framework
TACO (Terminal Agent Compression) is a training‑free, plug‑and‑play rule engine that observes raw terminal output and automatically derives compression rules. It operates in three tightly coupled stages:
1. Terminal Output Compression
For each interaction step the agent executes a command, receives raw output, and applies the current task’s active rules. Outputs containing explicit errors, failures, or diagnostic information are preserved; repetitive low‑value streams (installation progress, repeated test logs) are filtered.
2. Intra‑Task Rule Set Evolution
When no active rule can handle an output, TACO synthesises a new rule from the observed pattern and adds it to the task’s active rule set. Signals of over‑compression (e.g., the agent re‑requests the full output or repeats a command) trigger de‑activation of the offending rule and generation of a more conservative alternative.
3. Global Rule Pool Evolution
Effective rules discovered within a task are written back to a global rule pool. At the start of a new task TACO retrieves relevant rules from this pool to initialise the active set, enabling knowledge accumulation across diverse workflows.
Static Compression Baselines
Seed Rules : a few manually crafted rules for high‑output commands such as pip install, apt‑get, git clone.
High‑Quality Rules : an expanded manually curated rule set covering a broader range of commands, still static.
LLM Summarize : prompting a large language model to summarise the terminal output directly.
Experimental Evaluation
TACO was evaluated on TerminalBench 1.0, TerminalBench 2.0, and additional terminal‑related benchmarks (SWE‑Bench Lite, CompileBench, DevEval, CRUST‑Bench). Across all settings TACO consistently improved task success rates while reducing token consumption. Inserting TACO into Terminus‑2 yielded stable gains for strong models such as Qwen3‑Coder‑480B, DeepSeek‑V3.2, and MiniMax‑M2.5.
When token budgets were fixed, TACO still achieved higher accuracy than the baselines, indicating that the improvement stems from higher information density rather than more interaction steps.
Convergence of the self‑evolving process was measured with Retention – the overlap of the top‑K rules between consecutive evolution rounds. Retention values above 90 % correlated with a sharp drop in the rolling standard deviation of task accuracy, showing that a stable rule pool aligns with stable performance.
Case Studies
Installation log compression : In the apt‑get install -y r‑base task the raw output exceeded 10 000 characters. TACO evolved a rule that reduced the output to 73 characters, keeping only the installation status and error signals.
SQLite with gcov : The make output contained massive file‑copy lists and long compile commands. TACO removed the copy lists but retained coverage‑related flags ( -fprofile‑arcs, -ftest‑coverage) essential for downstream decisions.
Binary reverse‑engineering (vulnerable‑secret) : For objdump output TACO filtered repetitive hex‑dump lines while preserving call instructions, symbol labels, and key addresses needed for control‑flow analysis.
Analysis of Token vs. Accuracy
Static methods reduced token cost but showed unstable performance. The LLM‑Summarize baseline achieved the lowest token cost but a noticeable drop in accuracy. TACO’s token cost was higher than LLM‑Summarize but yielded the highest accuracy and the smallest variance, demonstrating that effective compression requires preserving critical decision‑making signals rather than aggressive shortening.
Cross‑Benchmark Generalisation
On SWE‑Bench Lite, DevEval, and CRUST‑Bench TACO improved accuracy while lowering total token consumption; on CompileBench accuracy remained unchanged but token usage dropped markedly. This indicates that the learned rules capture reusable compression patterns across heterogeneous terminal workflows.
Rule‑Stability as Convergence Signal
Instead of using test‑set accuracy (which would leak evaluation data), TACO monitors the stability of the Global Rule Pool. Retention is computed as the proportion of top‑K rules that remain unchanged between successive rounds. When Retention stabilises above 90 %, the rolling standard deviation of task accuracy also declines, providing a practical stopping criterion for self‑evolution.
Conclusion
TACO provides a training‑free, self‑evolving observation compression solution that enables CLI agents to automatically learn which terminal outputs can be safely discarded and which must be retained. By focusing on high‑value signals rather than raw token count, TACO improves both success rates and token efficiency across a wide range of benchmarks, demonstrating the practicality of self‑evolving rule‑based compression for long‑running autonomous agents.
For full details see the arXiv paper (arXiv:2604.19572), the Hugging Face Daily Paper entry, and the open‑source implementation at https://github.com/multimodal-art-projection/TACO.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
