How Hermes Turns Runtime Agent Executions into a Closed‑Loop Training Pipeline
The article explains how Hermes structures the runtime execution of agents—capturing tool calls, context changes, results, and rewards—so that these trajectories can be evaluated, fine‑tuned, and fed into reinforcement‑learning loops, creating a continuous improvement cycle.
From Runtime to a Training Loop
Hermes treats an Agent execution as data worth preserving. It records which tools are invoked, how the context evolves, whether the result is complete, and how the reward is computed, enabling the trace to enter evaluation, fine‑tuning, and reinforcement‑learning pipelines.
Four‑Layer Training Loop
Cron : proactively triggers tasks.
Trajectory : saves the multi‑turn rollout.
Environment : defines the task and scoring criteria.
Atropos/Tinker : feeds scored rollouts into SFT or RL training.
1. Reproducible Task Execution
A task entering the training loop must be a reproducible object containing:
Input (user question, dataset item, scheduled prompt, or external event).
Available tools.
Execution environment (sandbox, working directory, etc.).
Process record (model replies, tool calls, tool results, failures, retries).
Result status (completed, interrupted, etc.).
Evaluation signal (reward, metadata).
Hermes separates these concerns: the runtime completes the task, Trajectory records the process, Environment defines the problem and reward, and Batch Runner generates samples at scale.
2. Cron as a Continuous Task Source
Cron converts natural‑language times, intervals, or cron expressions into future tasks. Each due job starts a fresh AIAgent session with specified skills, work directory, and delivery channel, isolating the task from previous chat context.
cronjob(
action="create",
schedule="every 6h",
skills=["repo-auditor"],
workdir="/home/team/project",
prompt="检查未合并 PR、CI 状态和测试失败原因,输出工程风险摘要。",
deliver="wecom"
)Failures (e.g., credential expiry, API limits) are recorded with a status field and routed to appropriate queues. Status values include:
success
retryable_failure
terminal_failure
needs_human_review
3. Trajectory Saves Tool‑Call Process
Hermes stores rollouts in a ShareGPT‑compatible JSONL format, preserving system, human, assistant, and tool messages together with timestamps, model name, completion flag, metadata, and tool statistics.
{
"conversations": [
{"from": "human", "value": "检查项目测试失败原因"},
{"from": "assistant", "value": "<tool_call>... terminal pytest ...</tool_call>"},
{"from": "tool", "value": "<tool_response>测试输出...</tool_response>"},
{"from": "assistant", "value": "失败原因是数据库迁移缺少字段。"}
],
"completed": true,
"reward": 0.7,
"metadata": {"source": "cron", "status": "success"},
"tool_stats": {"terminal": {"count": 1, "success": 1, "failure": 0}}
}Batch Runner generates many such trajectories, records tool usage statistics, discards samples without reasoning, and filters hallucinated tool names to protect dataset quality.
4. Environment Provides Scorable Boundaries
Environment abstracts the task, dataset, prompt construction, tool set, sandbox backend, and reward function. It must answer:
What is the next item?
How to turn it into a user message?
Where to run tools?
How to score results?
Example implementation:
class MyEnv(HermesAgentBaseEnv):
name = "repo-fix-env"
async def get_next_item(self):
return self.dataset.next()
def format_prompt(self, item):
return item["issue_description"]
async def compute_reward(self, item, result, ctx):
test = ctx.terminal("pytest -q")
coverage = parse_coverage(ctx.terminal("coverage report"))
if test["exit_code"] != 0:
return {"reward": 0.0, "metadata": {"tests": "failed"}}
return {
"reward": 0.6 + 0.4 * coverage["line_rate"],
"metadata": {"tests": "passed", "coverage": coverage["line_rate"]}
}Environment supports three modes:
evaluate : run benchmarks and compute metrics.
process : generate scored JSONL for SFT data.
serve : expose an API that executes rollouts, computes rewards, and returns scored trajectories to Atropos.
5. Atropos and Tinker Close the Online RL Loop
In serve mode, the data flow is:
Atropos requests one or more items from Environment.
Environment formats each item into a prompt and runs a rollout in the Hermes runtime.
Environment computes a reward (e.g., test pass/fail, coverage, rule‑based score) and attaches metadata.
Environment returns the trajectory, reward, done flag, and metadata to Atropos.
Atropos groups rollouts of the same task, calculates advantage signals, and passes them to Tinker.
Tinker performs LoRA training, sampling, and policy updates using algorithms such as GRPO or PPO.
6. Reward Design and Sample Filtering Determine Data Quality
Reward functions should reflect the true task goal. Three reward shapes are used:
Binary : e.g., test pass/fail, file exists/absent. Risk: sparse signal.
Continuous : e.g., pass rate, coverage, error reduction, retrieval hit rate. Risk: may over‑emphasize intermediate metrics.
Composite : combines multiple signals for engineering, research, or long‑term automation tasks. Risk: complex weighting, requires manual calibration.
Before trajectories enter training, at least four filters are applied:
Discard incomplete or interrupted samples unless the goal is failure‑recovery training.
Validate tool names, parameters, and output formats to remove hallucinated calls.
Retain failure reasons but keep them separate from successful samples.
When compressing long trajectories, preserve the initial task, key tool calls, and final outcome.
7. Cron vs. Environment as Data Sources
Cron provides realistic, noisy production data (e.g., periodic repository audits, alert summaries). Its advantages are natural task distribution and long‑term relevance; its drawback is that scoring may be indirect.
Environment offers a controlled, well‑scored experimental setup. Its advantages are clear task boundaries and explicit rewards; its drawback is potential divergence from real‑world workflows.
A robust pipeline first runs Cron for a week to collect real tasks, filters high‑quality successes and recoverable failures, then rewrites those trajectories as Environment items with explicit rewards for process or serve experiments.
Conclusion
Cron gives Hermes a continuous task entry point; fresh sessions and recursion guards keep sample boundaries clear.
Trajectory records multi‑turn dialogues, tool calls, tool responses, and completion status as ShareGPT JSONL.
Environment packages tasks, tools, sandboxes, and reward functions into a unified, scoreable interface.
Batch Runner and process mode generate offline SFT data; serve mode connects to Atropos/Tinker for online RL.
Data quality hinges on reward design, sample filtering, failure‑sample handling, and trajectory compression; quantity alone is insufficient.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Step-by-Step
Sharing AI knowledge, practical implementation records, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
