AgentDoG 1.5: A Lightweight, Extensible Framework for Trajectory‑Level Agent Safety

AgentDoG 1.5 expands AI‑agent safety from final replies to complete execution trajectories, introducing the ATBench family for fine‑grained evaluation, a taxonomy‑guided DataEngine for high‑quality data generation, and demonstrating substantial safety gains in both SFT/RL training and online guardrail deployment with lightweight models.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
AgentDoG 1.5: A Lightweight, Extensible Framework for Trajectory‑Level Agent Safety

Existing safety mechanisms for agentic AI focus on the final response, but agents now perform multi‑step planning, tool invocation, and environment interaction, allowing risks to arise in intermediate steps such as incorrect tool calls, over‑privileged file operations, or polluted observations.

Trajectory‑Level Safety Goal

AgentDoG 1.5 defines safety over the entire execution trajectory, requiring the model to output a binary safe/unsafe judgment and, when unsafe, diagnose the Risk Source, Failure Mode, and Real‑world Harm. This shifts the focus from coarse prompt‑level checks to detailed, evidence‑backed analysis.

ATBench Family

The ATBench family extends evaluation from single answers to full trajectories. Each sample contains the user request, agent responses, tool calls, and environment feedback. The benchmark includes 1,000 audited trajectories (503 safe, 497 unsafe) covering 2,084 tools, an average of 9.01 interaction rounds, and about 3.95 k tokens per trajectory.

Two specialized subsets were added:

ATBench‑Claw targets OpenClaw‑style agents with session, approval, routing, and plugin/skill trust risks.

ATBench‑Codex targets code‑execution agents, covering repository files, shell commands, dependencies, MCP, patches, test outputs, and runtime policies.

Taxonomy‑Guided DataEngine

To continuously produce high‑quality trajectory data, the DataEngine first samples a risk combination from a three‑dimensional safety taxonomy, then plans user tasks, tool sets, execution steps, and risk injection points. A trajectory synthesis step instantiates the plan into multi‑round interactions, generating paired safe and unsafe versions.

Automatic validation applies rule‑based and model‑based checks to filter out format errors, schema mismatches, incoherent steps, or unlabeled evidence. The resulting pool covers 5,973 distinct tools and MCP servers, 9 risk sources, 18 failure modes, 10 real‑world harms, and 1,620 risk combinations.

Application 1: Safety‑Oriented SFT & RL

Using the DataEngine, 26,021 trajectory pairs were generated; after AgentDoG 1.5 filtering, 21,939 high‑quality safety trajectories remained. An additional 50,000 benign tool‑use trajectories were mixed to avoid over‑conservative behavior, yielding a 1:2 safety‑to‑benign ratio.

Fine‑tuning a Qwen‑3.5‑4B model on this data reduced the AgentHarm score from 57.49 % to 20.32 %, increased Refusal Rate from 28.41 % to 75.00 %, raised AgentSafetyBench Safe Rate from 34.37 % to 53.23 %, and improved BFCL function‑call accuracy to 81.12 %.

In the RL stage, a lightweight Python finite‑state environment (323 tools, 16 domains) provided rule‑based utility rewards while AgentDoG 1.5 supplied safety rewards. Joint SFT+RL training further lowered the Harm Score to 18.04 %, raised Refusal Rate to 77.27 %, and lifted Safe Rate to 59.32 % without sacrificing BFCL performance.

Application 2: Online Pre‑Reply Guardrail

AgentDoG 1.5 is deployed as a pre‑reply safety checkpoint: before the final reply reaches the user, the system aggregates the full trajectory (user input, tool calls, tool returns, observations, intermediate reasoning, draft reply) and feeds it to the model for safety assessment.

When the trajectory is safe, the reply proceeds; otherwise it is blocked or replaced and the diagnosis is logged. This approach avoids per‑step latency while providing richer context than final‑reply‑only checks.

In the final‑reply‑preventable evaluation, AgentDoG 1.5 reduced residual unsafe rates across benchmarks: ClawSafety ASR from 56.25 % to 18.75 %; AgentHazard Prompt‑Intel‑Theft ASR from 41.92 % to 34.23 %; CIK Core35 ASR from 94.29 % to 68.57 %.

Lightweight Model Performance

AgentDoG‑Qwen3‑4B achieves 91.8 % accuracy and 92.7 % F1 on R‑Judge, and 92.8 % accuracy and 93.0 % F1 on ATBench, outperforming most generic guard models. In fine‑grained ATBench diagnostics, it reaches 82.0 % Risk Source Accuracy, 32.4 % Failure Mode Accuracy, and 58.4 % Real‑world Harm Accuracy.

These results demonstrate that trajectory‑level supervision can be learned by compact models when supported by taxonomy‑driven data, explicit evidence, chain‑of‑thought rationales, and data‑purification.

Conclusion

AgentDoG 1.5 closes the loop from evaluation (ATBench Family) to data generation (DataEngine) to training (SFT & RL) and deployment (online guardrail). By treating safety as a continuous, trajectory‑aware process, it enables scalable, cost‑effective monitoring for agents that execute code, automate workflows, and manage long‑term state.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SFTAI safetylightweight modelRLAgentDoGATBenchDataEnginetrajectory evaluation
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.