How to Make AI Agents Reliable: Skillify’s 10‑Step Continuous Improvement Process

Agent systems often repeat the same failures, like missing historical calendar data or miscalculating time zones, but Garry Tan’s Skillify framework turns each error into a testable skill with a ten‑step checklist—including contracts, deterministic scripts, unit and integration tests, LLM evals, resolver checks, DRY audits, smoke tests, and knowledge‑base filing—to make agents structurally unable to repeat mistakes.

Fighter's World
Fighter's World
Fighter's World
How to Make AI Agents Reliable: Skillify’s 10‑Step Continuous Improvement Process
Agent 可靠性的工程解法:从 Skillify 看持续改进机制
Agent 可靠性的工程解法:从 Skillify 看持续改进机制

Why Agents Keep Failing

Agents often repeat the same mistakes. A calendar query for a ten‑year‑old business trip fails twice: first the live calendar API rejects the request, then a noisy email search yields no result, and finally a local grep finds the answer in milliseconds. The agent never checks the local knowledge base first, a classic case of sensor failure —the agent lacks awareness of existing data.

In another instance the agent tells a user the next meeting is in 28 minutes, but the real time is 88 minutes away because the model performs a UTC‑to‑PT conversion in its latent reasoning space instead of running the deterministic context-now.mjs script that already contains the correct calculation.

Both failures share a pattern: the agent has a deterministic tool available (a script) but chooses to reason with the model instead, leading to “vibes‑based” reliability that degrades as prompts become more complex.

Skillify: Turning Failures into Permanent Constraints

Garry Tan’s solution, called skillify , is a ten‑step checklist that converts each failure into a structured, testable skill so the same error cannot recur.

Step 1 – SKILL.md (Contract)

A SKILL.md file defines the skill’s name, description, and trigger rule. For the calendar‑recall case:

name: calendar-recall
description: "Brain‑first historical calendar lookup."
ALWAYS use this before any live API for any event
not in the future or the last 48 hours.

The hard rule enforces that historical queries go to the local knowledge base first.

Step 2 – Deterministic Script

The skill points the agent to a deterministic script, e.g. scripts/calendar-recall.mjs, which greps the local index of 3,146 calendar files and returns results in under 100 ms with zero LLM calls.

$ node scripts/calendar-recall.mjs "Singapore"
Found 2 matching day(s):
── 2016-05-07 ──
Flight to Singapore, Mandarin Oriental check‑in
── 2016-05-08 ──
Lunch with investors at Fullerton Hotel

Step 3 – Unit Tests

Each deterministic function (e.g. parseEventLine, eventMatchesKeyword, searchKeyword, formatJson) gets a vitest unit test against fixture data. Bugs such as Unicode‑character loss, leap‑year date handling, and missing attendees array are caught early.

Step 4 – Integration Tests

Integration tests run the script against real data (live calendar cache, real JSON from context-now.mjs) to catch format errors, missing time‑zone fields, Windows line endings, and midnight‑spanning events.

Step 5 – LLM Evaluations

Some outputs need qualitative judgment (e.g., “Is this calendar summary useful?”). An LLM‑as‑judge evaluates the model’s answer against a rubric, ensuring the agent does not rely on mental math when a script can provide the answer.

Step 6 – Resolver Trigger (Routing Table)

The resolver maps intent patterns to skills. For example:

Trigger pattern "historical calendar" → skill calendar-recall (high priority)

Trigger pattern "what time is" → skill context-now (high priority)

Step 7 – Resolver Evaluation

Resolver eval tests whether the correct skill fires. It catches false negatives (skill not triggered) and false positives (wrong skill chosen). Sample test cases include intents like "who is Pedro Franceschi" → brain-ops and "find my 2016 trip" → calendar-recall.

Step 8 – Check‑Resolvable & DRY Audit

A weekly audit walks the chain AGENTS.md → SKILL.md → script/cron to find unreachable skills. In Garry’s system, 6 of 40 skills were dark (15 % of capability). The audit also ensures no overlapping trigger patterns and that each skill’s script is callable.

Step 9 – End‑to‑End Smoke Test

Smoke tests verify the whole pipeline: asking “When do I go to Singapore?” must run calendar-recall.mjs and return the correctly formatted answer; asking “When is my next meeting?” must run context-now.mjs instead of performing mental conversion.

Step 10 – Brain Filing Rules

When a skill writes to the knowledge base, filing rules dictate the target directory ( people/, companies/, civic/). Garry found 10 of 13 brain‑writing skills were mis‑filed before adding explicit rules, after which mis‑filing dropped to zero.

Why Skillify Works

Mechanism 1: The latent reasoning space creates deterministic scripts, and those scripts constrain the latent space—a bootstrapping loop that uses the model’s intelligence to enforce its own limits.

Mechanism 2: It shifts reliability from “vibes‑based” (hope the model remembers) to “structurally impossible” (the architecture prevents the error).

Mechanism 3: Verifiability at every layer (unit, integration, LLM eval, resolver eval, smoke test) makes improvement measurable and repeatable.

Key Trade‑offs

Flexibility vs. Determinism: Transactional tasks (e.g., calendar look‑ups) are fully deterministic; creative tasks retain some judgment space.

Skill Count vs. System Complexity: Too many skills cause cognitive overload and routing ambiguity; DRY audits keep the skill set lean.

Skill Lifecycle Management: Skills decay via context‑rot, model evolution, or business drift; regular audits and automated fixes ( gbrain doctor --fix) are needed to prune obsolete constraints.

Key Insights

Agent reliability is an engineering discipline, not a model‑capacity issue.

Skillify turns probabilistic fixes (prompt tweaks) into deterministic, test‑driven constraints.

The latent‑deterministic bootstrapping loop is a powerful but under‑used design pattern.

Resolver engineering (routing, de‑duplication, reachability) is the final mile of reliability.

Long‑term success requires continuous audit to combat skill decay and technical debt.

Enjoy!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI agentstestingreliability engineeringContinuous ImprovementLLM evaluationskillify
Fighter's World
Written by

Fighter's World

Live in the future, then build what's missing

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.