Artificial Intelligence 16 min read

How Hermes Turns Runtime Agent Executions into a Closed‑Loop Training Pipeline

The article explains how Hermes structures the runtime execution of agents—capturing tool calls, context changes, results, and rewards—so that these trajectories can be evaluated, fine‑tuned, and fed into reinforcement‑learning loops, creating a continuous improvement cycle.

AI Step-by-Step

Apr 30, 2026

How Hermes Turns Runtime Agent Executions into a Closed‑Loop Training Pipeline

From Runtime to a Training Loop

Hermes treats an Agent execution as data worth preserving. It records which tools are invoked, how the context evolves, whether the result is complete, and how the reward is computed, enabling the trace to enter evaluation, fine‑tuning, and reinforcement‑learning pipelines.

Four‑Layer Training Loop

Cron : proactively triggers tasks.

Trajectory : saves the multi‑turn rollout.

Environment : defines the task and scoring criteria.

Atropos/Tinker : feeds scored rollouts into SFT or RL training.

1. Reproducible Task Execution

A task entering the training loop must be a reproducible object containing:

Input (user question, dataset item, scheduled prompt, or external event).

Available tools.

Execution environment (sandbox, working directory, etc.).

Process record (model replies, tool calls, tool results, failures, retries).

Result status (completed, interrupted, etc.).

Evaluation signal (reward, metadata).

Hermes separates these concerns: the runtime completes the task, Trajectory records the process, Environment defines the problem and reward, and Batch Runner generates samples at scale.

2. Cron as a Continuous Task Source

Cron converts natural‑language times, intervals, or cron expressions into future tasks. Each due job starts a fresh AIAgent session with specified skills, work directory, and delivery channel, isolating the task from previous chat context.

cronjob(
    action="create",
    schedule="every 6h",
    skills=["repo-auditor"],
    workdir="/home/team/project",
    prompt="检查未合并 PR、CI 状态和测试失败原因，输出工程风险摘要。",
    deliver="wecom"
)

Failures (e.g., credential expiry, API limits) are recorded with a status field and routed to appropriate queues. Status values include:

success

retryable_failure

terminal_failure

needs_human_review

3. Trajectory Saves Tool‑Call Process

Hermes stores rollouts in a ShareGPT‑compatible JSONL format, preserving system, human, assistant, and tool messages together with timestamps, model name, completion flag, metadata, and tool statistics.

{
  "conversations": [
    {"from": "human", "value": "检查项目测试失败原因"},
    {"from": "assistant", "value": "<tool_call>... terminal pytest ...</tool_call>"},
    {"from": "tool", "value": "<tool_response>测试输出...</tool_response>"},
    {"from": "assistant", "value": "失败原因是数据库迁移缺少字段。"}
  ],
  "completed": true,
  "reward": 0.7,
  "metadata": {"source": "cron", "status": "success"},
  "tool_stats": {"terminal": {"count": 1, "success": 1, "failure": 0}}
}

Batch Runner generates many such trajectories, records tool usage statistics, discards samples without reasoning, and filters hallucinated tool names to protect dataset quality.

4. Environment Provides Scorable Boundaries

Environment abstracts the task, dataset, prompt construction, tool set, sandbox backend, and reward function. It must answer:

What is the next item?

How to turn it into a user message?

Where to run tools?

How to score results?

Example implementation:

class MyEnv(HermesAgentBaseEnv):
    name = "repo-fix-env"

    async def get_next_item(self):
        return self.dataset.next()

    def format_prompt(self, item):
        return item["issue_description"]

    async def compute_reward(self, item, result, ctx):
        test = ctx.terminal("pytest -q")
        coverage = parse_coverage(ctx.terminal("coverage report"))
        if test["exit_code"] != 0:
            return {"reward": 0.0, "metadata": {"tests": "failed"}}
        return {
            "reward": 0.6 + 0.4 * coverage["line_rate"],
            "metadata": {"tests": "passed", "coverage": coverage["line_rate"]}
        }

Environment supports three modes:

evaluate : run benchmarks and compute metrics.

process : generate scored JSONL for SFT data.

serve : expose an API that executes rollouts, computes rewards, and returns scored trajectories to Atropos.

5. Atropos and Tinker Close the Online RL Loop

In serve mode, the data flow is:

Atropos requests one or more items from Environment.

Environment formats each item into a prompt and runs a rollout in the Hermes runtime.

Environment computes a reward (e.g., test pass/fail, coverage, rule‑based score) and attaches metadata.

Environment returns the trajectory, reward, done flag, and metadata to Atropos.

Atropos groups rollouts of the same task, calculates advantage signals, and passes them to Tinker.

Tinker performs LoRA training, sampling, and policy updates using algorithms such as GRPO or PPO.

6. Reward Design and Sample Filtering Determine Data Quality

Reward functions should reflect the true task goal. Three reward shapes are used:

Binary : e.g., test pass/fail, file exists/absent. Risk: sparse signal.

Continuous : e.g., pass rate, coverage, error reduction, retrieval hit rate. Risk: may over‑emphasize intermediate metrics.

Composite : combines multiple signals for engineering, research, or long‑term automation tasks. Risk: complex weighting, requires manual calibration.

Before trajectories enter training, at least four filters are applied:

Discard incomplete or interrupted samples unless the goal is failure‑recovery training.

Validate tool names, parameters, and output formats to remove hallucinated calls.

Retain failure reasons but keep them separate from successful samples.

When compressing long trajectories, preserve the initial task, key tool calls, and final outcome.

7. Cron vs. Environment as Data Sources

Cron provides realistic, noisy production data (e.g., periodic repository audits, alert summaries). Its advantages are natural task distribution and long‑term relevance; its drawback is that scoring may be indirect.

Environment offers a controlled, well‑scored experimental setup. Its advantages are clear task boundaries and explicit rewards; its drawback is potential divergence from real‑world workflows.

A robust pipeline first runs Cron for a week to collect real tasks, filters high‑quality successes and recoverable failures, then rewrites those trajectories as Environment items with explicit rewards for process or serve experiments.

Conclusion

Cron gives Hermes a continuous task entry point; fresh sessions and recursion guards keep sample boundaries clear.

Trajectory records multi‑turn dialogues, tool calls, tool responses, and completion status as ShareGPT JSONL.

Environment packages tasks, tools, sandboxes, and reward functions into a unified, scoreable interface.

Batch Runner and process mode generate offline SFT data; serve mode connects to Atropos/Tinker for online RL.

Data quality hinges on reward design, sample filtering, failure‑sample handling, and trajectory compression; quantity alone is insufficient.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cron Hermes Environment Trajectory RL agent runtime Training Loop Atropos

Written by

AI Step-by-Step

Sharing AI knowledge, practical implementation records, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.