Can Your AI Agent Earn a College Degree? Exploring Clawvard’s Evaluation Platform
The author explores Clawvard, an AI‑agent assessment platform that tests agents across eight dimensions, shares personal test results showing an initial A‑ rating with a critical retrieval weakness, details the customized improvement rules applied, and demonstrates a subsequent A+ rating, while also discussing the platform’s limits and practical use cases.
What is Clawvard?
Clawvard (虾佛大学) is an AI‑Agent assessment platform that evaluates agents on eight dimensions—Understanding, Execution, Retrieval, Reasoning, Reflection, Tooling, EQ, and Memory—via 16 questions (two per dimension).
First Exam (baseline)
Using Claude Code with the GLM‑4.7 model, the exam was started with the command:
Read https://clawvard.school/skill.md
# then follow the prompts to begin the testResult: rating A‑, above 54 % of agents, total score 84.4/100. Detailed scores: Retrieval 30/100, Execution 80/100 (token‑bucket implementation truncated), Tooling 80/100 (missing idempotency checks). The report also highlighted regex and search‑strategy issues.
Regular‑expression task: phone and semver patterns mostly correct but with minor issues; email pattern reasonable but incomplete (missing examples and edge‑case analysis). Search‑strategy task: wrong – should start from the narrowest time window (deployment logs) instead of error logs.
Personalized improvement rules
Retrieval
When searching for information:
1. Use specific keywords, not vague descriptions
2. Search with exact identifiers (function names, error codes)
3. Read file structure before diving into contents
4. Verify information from multiple sources
5. Cite your sourcesExecution
When completing tasks:
1. Break into small, verifiable steps
2. After each step, verify the output before proceeding
3. Never leave tasks half‑done
4. Run tests or checks when applicable
5. Confirm completion explicitlyTooling
When using tools:
1. Verify the tool exists before calling it
2. Check documentation for correct usage
3. Handle errors gracefully — don’t crash on tool failures
4. Validate tool output before using it
5. Follow security best practicesThese rules were written into the agent’s memory system so they are applied automatically in future interactions.
Second Exam (after applying rules)
With an authentication token the same model retook the test.
Result: rating A+, above 88 % of agents, total score 91.3/100 (↑ 7 points). Retrieval improved from 30 to 80, the most significant gain.
Model comparison
CodeX using GPT‑5.4 scored 92.5 on the first exam, demonstrating that different models have distinct strengths.
Interpretation
The scores reveal specific blind spots (e.g., low retrieval ability).
Improvement suggestions can be encoded as actionable rules.
Re‑testing confirms measurable impact.
The exam is a generic scenario; high scores do not guarantee performance on domain‑specific tasks.
Practical recommendations
Use Clawvard as a diagnostic “health check” for AI agents.
Incorporate the generated improvement rules into the agent’s prompt or memory.
Validate changes by re‑running the exam or by testing on real tasks.
Base final model selection on actual task outcomes rather than the numeric rating alone.
Applicable scenarios
Comparing capabilities of different AI models.
Evaluating the effectiveness of AI‑agent workflows.
Testing the impact of prompt‑engineering changes.
Conducting capability “health checks” for team members’ AI usage.
How to run the assessment yourself
Read https://clawvard.school/skill.md
# follow the prompts to take the exam and obtain the agent’s reportExample report URLs:
First exam: https://clawvard.school/verify?exam=exam-227c1516
Second exam: https://clawvard.school/report?id=eval-a33c7445
AI Code to Success
Focused on hardcore practical AI technologies (OpenClaw, ClaudeCode, LLMs, etc.) and HarmonyOS development. No hype—just real-world tips, pitfall chronicles, and productivity tools. Follow to transform workflows with code.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
