Building an AI Workbench: Practical Agentic Engineering with Plans, Context, and Verification

The article distills insights from Matt Van Horn and John Kim on Agentic Engineering, proposing a five‑layer AI workbench (plan, context, execution, verification, governance), controlled parallelism, context engineering, reusable Skills, Hooks, Subagents, permission models, review templates, and a one‑week team experiment to embed engineering habits into AI‑driven workflows.

Architect
Architect
Architect
Building an AI Workbench: Practical Agentic Engineering with Plans, Context, and Verification

Impressions

Reading the two recent posts – Matt Van Horn’s "Every Agentic Engineering Hack I Know" and John Kim’s "Claude Code 50 Tips" – creates a vivid picture of an AI workbench where the left side holds the plan, the right side holds the context, followed by testing, logs, permissions, and rollback. The components themselves (plan.md, CLAUDE.md, hooks, subagents, worktrees) are not new, but when assembled into an Agentic Engineering workflow they give ordinary engineering habits a concrete, tool‑driven form.

plan.md, Plan mode, Dynamic Workflows: appear as documentation but actually encode goals, boundaries, and stop conditions for complex tasks.

CLAUDE.md, historical plans, meeting notes, search tools, note‑books: value lies in providing the right information at the right time, not just piling up context.

Skills, hooks, subagents, plugins: act as process assets, feedback mechanisms, context isolation, and capability packaging.

Running multiple agents is less about opening 4‑6 windows and more about ensuring each window has a clean workspace, clear tasks, verifiable results, and merge rules.

For architects, the most useful artifact is the layered workbench – plan, context, execution, verification, and governance.

Parallelism

Many people’s first reaction to Agentic Engineering is "parallelism" – opening several terminals for planning, coding, research, and bug fixing. Matt’s daily routine indeed looks like this, with separate cmux sessions for /ce-plan, /ce-work, a last30days community scan, and a bug‑fix session. John Kim describes a similar approach with multiple Claude Code instances, iTerm splits, Git worktrees, and notifications.

However, copying this directly into a team setting often hits three problems:

Multiple agents editing the same workspace cause file overwrites.

Shared context leads to stale errors mixing with new goals.

Output volume grows faster than review, testing, merging, and rollback can keep up.

Therefore, I place "multiple agents" later in the workflow and focus first on "controllable parallelism" – clearly defining task decomposition, workspace isolation, result validation, conflict resolution, and rollback strategies.

Plan

Matt’s workflow always starts by generating a plan.md. He treats everything – product ideas, GitHub issues, terminal errors, screenshots, designs, Slack discussions – as a plan that can be stored in /ce-plan. The plan is not just a document; it provides a stable anchor for later execution, verification, and hand‑off.

A useful plan should answer these questions:

What problem does the task solve?

What is explicitly out of scope?

Which files, links, logs, or screenshots are needed?

Which modules might be affected?

Why was the implementation path chosen?

What are the acceptance criteria?

When can the work stop?

Where does the next iteration resume if context is interrupted?

In practice I treat plan.md as a lightweight task contract rather than a full project plan. An example skeleton:

# plan.md

## Goal
What problem this task aims to solve.

## Out of Scope
Explicitly list what will not be changed.

## Background & Input
Issue link, screenshots, error logs, user feedback, related URLs.

## Affected Files
List candidate files; accuracy can improve over time.

## Implementation Steps
Break into verifiable small steps.

## Acceptance Criteria
What result counts as completed.

## Validation Commands
Test, build, lint, screenshots, log checks.

## Current Status
Done, pending, blocked, next steps.

If a more formal contract is needed, I add two sections:

## Evidence Requirements
- Which commands must run
- Whether screenshots or logs are required
- Which failures are acceptable and which require a stop

## Merge Conditions
- Who reviews
- What PR description must contain
- Which risks need manual confirmation

This file is read by the Agent, the reviewer, the next session, and future selves.

Context

Both Matt and John emphasize context, but with different intensities. Matt injects raw transcripts, community scans, and historical plans directly into the agent. John warns that CLAUDE.md can bloat and that token usage must be monitored; he also stresses cleaning the context after a task.

The key insight is that more context is not always better – it must be placed correctly.

I split context into three tiers:

Immediate Window Context : current goal, constraints, recent errors, key diffs – used for the next round of reasoning.

File‑Based Context : provide plan.md, CLAUDE.md, runtime instructions, design records – read multiple times during the task.

External Retrieval : full meeting transcripts, incident reports, community discussions, old PRs – fetched only when needed.

This layering prevents the model from being flooded with irrelevant data while still giving it access to the information it truly needs.

Experience (Skills)

Many newcomers treat Skills as "advanced prompts", but in practice a Skill is a reusable, executable process. Matt suggests turning any repeated action (more than twice) into a Skill. Kaxil Naik’s long‑form article reinforces this, and Addy Osmani describes Skills as process assets rather than mere documentation.

A minimal Skill example for API change verification:

---
name: api-change-verification
description: When modifying an API, DTO, permission logic, or call chain, verify compatibility and regression risk.
---
# API Change Verification
## Read
- OpenAPI schema
- Relevant controller / route files
- List of callers
- Most recent incident report
## Checks
1. Run contract tests.
2. Run affected module unit tests.
3. Verify caller field compatibility.
4. If permissions are involved, run permission regression suite.
## Output Evidence
- Which APIs changed
- Commands executed
- Passed / failed results
- Remaining manual risk assessment
## Gotchas
- HTTP 200 does not guarantee business success; inspect response body codes.
- Legacy clients may still send old payloads.

Hooks are event‑driven extensions that intervene at specific points. Claude Code defines events such as PreToolUse, PostToolUse, Stop, SubagentStart, and PreCompact. Two practical hook categories:

Block dangerous actions (deletions, migrations, production config changes, sensitive directory access).

After tool execution, collect key logs or enforce validation steps.

Comparison of mechanisms:

Skill : describes a process, checklist, and output evidence; not for hard blocking.

Hook : intercepts commands, adds logs, reminds about validation; not for complex reasoning.

Subagent : isolates search, audit, and review context; suited for independent tasks.

Plugin : packages team capabilities; not a replacement for governance.

Isolation (Subagents)

Subagents are often described as "multiple roles" (reviewer, researcher, security, test agents). The real value is isolation: a subtask that generates large, low‑density context (searching logs, comparing files) runs in its own worktree or branch, returning only high‑density results (conclusion, evidence, risk, next steps) to the main session.

Typical candidates for subagents:

Bulk retrieval.

Independent review.

Security audit.

Large repository scans.

Read‑only data aggregation.

Narrow‑scope tasks with tight permissions.

Tasks that require frequent negotiation or share a lot of intermediate state are better kept in the main agent.

Permissions

When moving from a personal workbench to a team, I adopt a three‑layer permission model:

Read‑only layer : Agents may read issues, logs, docs, monitoring data, test results, and meeting notes, but cannot write to production systems.

Semi‑automatic layer : Agents can generate drafts, commands, PRs, change requests, and reply suggestions, but a human must confirm execution.

Controlled write layer : Low‑risk, rollback‑able, auditable actions (e.g., creating temporary branches, updating draft docs, running local scripts) may be executed automatically.

Specific capability policies (default actions in parentheses):

Read code & docs – allowed (watch for private repos).

Run tests & builds – allowed (with timeout and resource limits).

Write local files – allowed if confined to a branch or worktree.

Open PR – human approval required; PR description must include verification evidence.

Access Slack / email – read‑only by default to avoid leaks and prompt injection.

Change production config – prohibited; must go through approval and audit.

Use logged‑in browser sessions – high risk; only with isolated accounts and explicit allow‑list.

For any integration with real services (Slack, email, browsers), I first write a threat model and then the automation.

Review

Kaxil Naik notes that code generation is cheap, but review becomes the bottleneck. Agents can produce PRs that look plausible while silently failing. Adding verification evidence to the PR eases review. A concise PR template that captures the essential information:

## Goal
What problem this PR solves.

## Scope
Which modules are changed; what is explicitly untouched.

## Verification Evidence
- Commands run
- Results
- Screenshots / logs

## Risks
- Uncovered areas
- Items needing manual confirmation

Without this evidence, a PR merely shifts the work from "write code" to "guess correctness".

One‑Week Experiment

To try the workbench in a team, I start small and run a seven‑day pilot on a medium‑complexity bug fix.

Day 1 – Choose a task : not trivial, low risk, reproducible, with clear verification steps.

Day 2 – Write a minimal plan.md : goal, out‑of‑scope, involved files, acceptance criteria, validation command.

Day 3 – Add a concise CLAUDE.md : only the domain and validation needed for this task.

Day 4 – Execute in an isolated environment : separate branch or worktree, leave evidence at each step, note any skipped validation.

Day 5 – Conduct a thorough manual review : examine diff, tests, and agent explanations; flag silent failures or over‑modifications.

Day 6 – Record gotchas : write 1‑2 critical failure points, turn them into verification commands or hooks.

Day 7 – Evaluate value : answer four questions – does the agent reduce repetitive work, is verification clearer, is review cost lower, can the experience be reused?

Optional quantitative metrics:

Rework count – how many times the reviewer asked the agent to redo work.

Missing verification – errors caught only by humans.

Context pollution – whether the agent referenced stale goals or wrong files.

Merge latency – time from PR creation to mergeable state.

Experience capture – number of new gotchas, Skills, or hooks added.

If the pilot fails to answer these, I hold off on adding MCP, Subagents, or plugins. Stable plan, verification, and review are enough to continue.

Five Core Actions

Plan file : always create a lightweight plan.md that records goal, boundaries, acceptance, and validation.

CLAUDE.md : capture project domain, constraints, and how to prove completion.

Verification Skill : write a Skill for the most error‑prone checks before automating. Hooks : start with simple pre‑action blocks that block dangerous commands and post‑action blocks that collect logs. Subagents : isolate tasks that can run independently and return high‑density results.

Conclusion

Agentic Engineering turns everyday engineering habits into structured, auditable artifacts. When the workbench clearly records plans, provides the right context, runs verifiable steps, enforces permission boundaries, and captures failure experience, agents become reliable collaborators rather than unpredictable black boxes. The real bottleneck shifts from writing code to defining boundaries, evidence, responsibility, and ongoing maintenance.

References

Matt Van Horn, "Every Agentic Engineering Hack I Know" (June 2026) – https://x.com/mvanhorn/status/2061877533885473181

John Kim, "How I use Claude Code" (Meta Staff Engineer) – https://www.youtube.com/watch?v=mZzhfPle9QU

Kaxil Naik, "I Haven't Written a Line of Code in 4 Months" – https://x.com/kaxil/status/2037503513350005134

Simon Willison, "Context engineering" – https://simonwillison.net/2025/jun/27/context-engineering/

Addy Osmani, "Agent Skills" – https://www.oreilly.com/radar/agent-skills/

Claude Code Docs – Skills – https://code.claude.com/docs/en/skills

Claude Code Docs – Hooks – https://code.claude.com/docs/en/hooks

Claude Code Docs – Subagents – https://code.claude.com/docs/en/sub-agents

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIautomationworkflowhooksSkillsAgentic EngineeringSubagents
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.