A Practical Guide to Evaluating Agent Skills
This article explains why many Agent Skills are released without testing, defines measurable success criteria, and presents a lightweight evaluation framework—including prompt set creation, deterministic checks, optional LLM‑based qualitative checks, and best‑practice recommendations—demonstrated by improving a Gemini Interactions API skill from 66.7% to 100% pass rate.
1. What are Agent Skills
Agent Skills are folders containing instructions, scripts and resources that extend an agent's capabilities without retraining. They follow a progressive disclosure model and must contain at least a SKILL.md file. The folder consists of Frontmatter (YAML name/description), Body (Markdown guide), and optional Resources (scripts/, examples/, references/).
2. Define Success Criteria before writing a Skill
Success is expressed in measurable terms: Result (does the skill produce usable output such as compiled code, rendered image, valid API response), Style & Instructions (correct SDK import, current model ID, naming conventions, required format), and Efficiency (tokens, time, retries). For the Gemini Interactions API skill the checks include correct SDK import ( from google import genai), non‑deprecated model ( gemini-2.0-flash), and use of interactions.create() instead of generateContent.
3. Evaluation Framework – Practical Steps
3.1 Create a Prompt Set
Start with 10‑20 prompts per skill, each targeting a specific scenario and declaring its own expected_checks. Example JSON prompt objects are shown below.
[
{
"id": "py_basic_generation",
"prompt": "Write a Python script that sends a text prompt to Gemini and prints the response.",
"language": "python",
"should_trigger": true,
"expected_checks": ["correct_sdk", "no_old_sdk", "current_model", "interactions_api"]
},
{
"id": "py_deprecated_model",
"prompt": "Write a Python script using Gemini 2.0 Flash with the Interactions API.",
"language": "python",
"should_trigger": true,
"expected_checks": ["correct_sdk", "interactions_api", "deprecated_model_rejected"]
},
{
"id": "negative_unrelated",
"prompt": "Write a Python script that reads a CSV and plots a bar chart using matplotlib.",
"language": "python",
"should_trigger": false,
"expected_checks": []
}
]3.2 Run the Agent and Capture Output
Invoke the skill via CLI (e.g.,
gemini -m gemini-3-flash-preview --output-format json -p "prompt") and parse the JSON response.
def run_gemini_cli(prompt):
cmd = [
"gemini",
"-m",
"gemini-3-flash-preview",
"--output-format",
"json",
"--yolo",
"-p",
prompt,
]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=3600)
data = json.loads(result.stdout.strip())
return CLIOutput(
response_text=data.get("response", ""),
stats=data.get("stats", {}),
exit_code=result.returncode,
)3.3 Write Deterministic Checks
Each check is a small function that uses regular expressions to validate the extracted code and returns a boolean.
# Does the code import the correct SDK?
def check_correct_sdk(code, language):
if language == "python":
return bool(re.search(r"from\s+google\s+import\s+genai", code))
return bool(re.search(r"['\"]@google/genai['\"]", code))
# Does the code avoid deprecated models?
DEPRECATED_MODELS = ["gemini-2.0-flash", "gemini-1.5-pro", "gemini-1.5-flash"]
def check_current_model(code, language):
return not any(model in code for model in DEPRECATED_MODELS)3.4 (Optional) Add LLM‑based Qualitative Checks
When structural or design quality cannot be captured by regex, a second‑stage LLM can grade the output using a typed schema.
from pydantic import BaseModel, Field
class CheckResult(BaseModel):
passed: bool
notes: str = Field(description="Brief explanation of the assessment.")
class DesignEvalResult(BaseModel):
overall_pass: bool
score: int = Field(ge=0, le=100)
typography: CheckResult = Field(description="Uses distinctive fonts, avoids generic choices like Inter/Arial/Roboto.")
color_cohesion: CheckResult = Field(description="Cohesive palette with CSS variables, no timid evenly‑distributed colors.")
layout: CheckResult = Field(description="Intentional spatial composition — asymmetry, overlap, or bold grid choices.")
generic_ai_avoidance: CheckResult = Field(description="No purple‑gradient‑on‑white, no cookie‑cutter patterns.")The evaluation loop registers all checks, runs each test case, and aggregates results.
CHECK_REGISTRY = {
"correct_sdk": check_correct_sdk,
"current_model": check_current_model,
"interactions_api": check_interactions_api,
"no_old_sdk": check_no_old_sdk,
# ... total 11 checks
}
def run_eval(test_case):
output = run_gemini_cli(test_case["prompt"])
code = extract_code_blocks(output.response_text)
results = {}
for check_id in test_case["expected_checks"]:
results[check_id] = CHECK_REGISTRY[check_id](code, test_case["language"])
return resultsApplying this framework to the Gemini Interactions API skill raised the pass rate from 66.7 % to 100 % . The two most effective fixes were rewriting the skill description to better match user intent and replacing passive deprecation warnings with explicit commands; the description change alone fixed five out of seven failures.
4. Best‑Practice Checklist
Start from a precise Skill name and description; vague descriptions lead to missed or spurious triggers.
Use explicit commands (e.g., interactions.create()) rather than ambiguous instructions.
Include negative tests to ensure over‑broad Skills do not fire on unrelated prompts.
Begin with a small prompt set (10‑20) and expand from real failure reports.
Evaluate the result, not the execution path; reward correct outcomes even if the path differs.
Isolate each run in a clean environment to avoid context bleed.
Run each prompt multiple times (3‑5) because agent behavior is nondeterministic.
Test the same Skill across different agent frameworks if applicable.
Upgrade tests from capability to regression once coverage approaches 100 %.
Detect skill retirement by testing after removal; if it still passes, the model has internalized the capability.
5. Further Reading
Demystifying Evals for AI Agents
Improving Skill‑Creator
Testing Agent Skills Systematically with Evals
Evaluating Deep Agents
SkillsBench
AI Tech Publishing
In the fast-evolving AI era, we thoroughly explain stable technical foundations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
