A Practical Guide to Evaluating Agent Skills

This article explains why many Agent Skills are released without testing, defines measurable success criteria, and presents a lightweight evaluation framework—including prompt set creation, deterministic checks, optional LLM‑based qualitative checks, and best‑practice recommendations—demonstrated by improving a Gemini Interactions API skill from 66.7% to 100% pass rate.

AI Tech Publishing
AI Tech Publishing
AI Tech Publishing
A Practical Guide to Evaluating Agent Skills

1. What are Agent Skills

Agent Skills are folders containing instructions, scripts and resources that extend an agent's capabilities without retraining. They follow a progressive disclosure model and must contain at least a SKILL.md file. The folder consists of Frontmatter (YAML name/description), Body (Markdown guide), and optional Resources (scripts/, examples/, references/).

2. Define Success Criteria before writing a Skill

Success is expressed in measurable terms: Result (does the skill produce usable output such as compiled code, rendered image, valid API response), Style & Instructions (correct SDK import, current model ID, naming conventions, required format), and Efficiency (tokens, time, retries). For the Gemini Interactions API skill the checks include correct SDK import ( from google import genai), non‑deprecated model ( gemini-2.0-flash), and use of interactions.create() instead of generateContent.

3. Evaluation Framework – Practical Steps

3.1 Create a Prompt Set

Start with 10‑20 prompts per skill, each targeting a specific scenario and declaring its own expected_checks. Example JSON prompt objects are shown below.

[
  {
    "id": "py_basic_generation",
    "prompt": "Write a Python script that sends a text prompt to Gemini and prints the response.",
    "language": "python",
    "should_trigger": true,
    "expected_checks": ["correct_sdk", "no_old_sdk", "current_model", "interactions_api"]
  },
  {
    "id": "py_deprecated_model",
    "prompt": "Write a Python script using Gemini 2.0 Flash with the Interactions API.",
    "language": "python",
    "should_trigger": true,
    "expected_checks": ["correct_sdk", "interactions_api", "deprecated_model_rejected"]
  },
  {
    "id": "negative_unrelated",
    "prompt": "Write a Python script that reads a CSV and plots a bar chart using matplotlib.",
    "language": "python",
    "should_trigger": false,
    "expected_checks": []
  }
]

3.2 Run the Agent and Capture Output

Invoke the skill via CLI (e.g.,

gemini -m gemini-3-flash-preview --output-format json -p "prompt"

) and parse the JSON response.

def run_gemini_cli(prompt):
    cmd = [
        "gemini",
        "-m",
        "gemini-3-flash-preview",
        "--output-format",
        "json",
        "--yolo",
        "-p",
        prompt,
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=3600)
    data = json.loads(result.stdout.strip())
    return CLIOutput(
        response_text=data.get("response", ""),
        stats=data.get("stats", {}),
        exit_code=result.returncode,
    )

3.3 Write Deterministic Checks

Each check is a small function that uses regular expressions to validate the extracted code and returns a boolean.

# Does the code import the correct SDK?
def check_correct_sdk(code, language):
    if language == "python":
        return bool(re.search(r"from\s+google\s+import\s+genai", code))
    return bool(re.search(r"['\"]@google/genai['\"]", code))

# Does the code avoid deprecated models?
DEPRECATED_MODELS = ["gemini-2.0-flash", "gemini-1.5-pro", "gemini-1.5-flash"]

def check_current_model(code, language):
    return not any(model in code for model in DEPRECATED_MODELS)

3.4 (Optional) Add LLM‑based Qualitative Checks

When structural or design quality cannot be captured by regex, a second‑stage LLM can grade the output using a typed schema.

from pydantic import BaseModel, Field

class CheckResult(BaseModel):
    passed: bool
    notes: str = Field(description="Brief explanation of the assessment.")

class DesignEvalResult(BaseModel):
    overall_pass: bool
    score: int = Field(ge=0, le=100)
    typography: CheckResult = Field(description="Uses distinctive fonts, avoids generic choices like Inter/Arial/Roboto.")
    color_cohesion: CheckResult = Field(description="Cohesive palette with CSS variables, no timid evenly‑distributed colors.")
    layout: CheckResult = Field(description="Intentional spatial composition — asymmetry, overlap, or bold grid choices.")
    generic_ai_avoidance: CheckResult = Field(description="No purple‑gradient‑on‑white, no cookie‑cutter patterns.")

The evaluation loop registers all checks, runs each test case, and aggregates results.

CHECK_REGISTRY = {
    "correct_sdk": check_correct_sdk,
    "current_model": check_current_model,
    "interactions_api": check_interactions_api,
    "no_old_sdk": check_no_old_sdk,
    # ... total 11 checks
}

def run_eval(test_case):
    output = run_gemini_cli(test_case["prompt"])
    code = extract_code_blocks(output.response_text)
    results = {}
    for check_id in test_case["expected_checks"]:
        results[check_id] = CHECK_REGISTRY[check_id](code, test_case["language"])
    return results

Applying this framework to the Gemini Interactions API skill raised the pass rate from 66.7 % to 100 % . The two most effective fixes were rewriting the skill description to better match user intent and replacing passive deprecation warnings with explicit commands; the description change alone fixed five out of seven failures.

4. Best‑Practice Checklist

Start from a precise Skill name and description; vague descriptions lead to missed or spurious triggers.

Use explicit commands (e.g., interactions.create()) rather than ambiguous instructions.

Include negative tests to ensure over‑broad Skills do not fire on unrelated prompts.

Begin with a small prompt set (10‑20) and expand from real failure reports.

Evaluate the result, not the execution path; reward correct outcomes even if the path differs.

Isolate each run in a clean environment to avoid context bleed.

Run each prompt multiple times (3‑5) because agent behavior is nondeterministic.

Test the same Skill across different agent frameworks if applicable.

Upgrade tests from capability to regression once coverage approaches 100 %.

Detect skill retirement by testing after removal; if it still passes, the model has internalized the capability.

5. Further Reading

Demystifying Evals for AI Agents

Improving Skill‑Creator

Testing Agent Skills Systematically with Evals

Evaluating Deep Agents

SkillsBench

AI agentsLLMGeminievaluationAgent SkillsPrompt Testing
AI Tech Publishing
Written by

AI Tech Publishing

In the fast-evolving AI era, we thoroughly explain stable technical foundations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.