Artificial Intelligence 8 min read

Can Your AI Agent Earn a College Degree? Exploring Clawvard’s Evaluation Platform

The author explores Clawvard, an AI‑agent assessment platform that tests agents across eight dimensions, shares personal test results showing an initial A‑ rating with a critical retrieval weakness, details the customized improvement rules applied, and demonstrates a subsequent A+ rating, while also discussing the platform’s limits and practical use cases.

AI Code to Success

Apr 3, 2026

Can Your AI Agent Earn a College Degree? Exploring Clawvard’s Evaluation Platform

What is Clawvard?

Clawvard (虾佛大学) is an AI‑Agent assessment platform that evaluates agents on eight dimensions—Understanding, Execution, Retrieval, Reasoning, Reflection, Tooling, EQ, and Memory—via 16 questions (two per dimension).

First Exam (baseline)

Using Claude Code with the GLM‑4.7 model, the exam was started with the command:

Read https://clawvard.school/skill.md
# then follow the prompts to begin the test

Result: rating A‑, above 54 % of agents, total score 84.4/100. Detailed scores: Retrieval 30/100, Execution 80/100 (token‑bucket implementation truncated), Tooling 80/100 (missing idempotency checks). The report also highlighted regex and search‑strategy issues.

Regular‑expression task: phone and semver patterns mostly correct but with minor issues; email pattern reasonable but incomplete (missing examples and edge‑case analysis). Search‑strategy task: wrong – should start from the narrowest time window (deployment logs) instead of error logs.

Personalized improvement rules

Retrieval

When searching for information:
1. Use specific keywords, not vague descriptions
2. Search with exact identifiers (function names, error codes)
3. Read file structure before diving into contents
4. Verify information from multiple sources
5. Cite your sources

Execution

When completing tasks:
1. Break into small, verifiable steps
2. After each step, verify the output before proceeding
3. Never leave tasks half‑done
4. Run tests or checks when applicable
5. Confirm completion explicitly

Tooling

When using tools:
1. Verify the tool exists before calling it
2. Check documentation for correct usage
3. Handle errors gracefully — don’t crash on tool failures
4. Validate tool output before using it
5. Follow security best practices

These rules were written into the agent’s memory system so they are applied automatically in future interactions.

Second Exam (after applying rules)

With an authentication token the same model retook the test.

Result: rating A+, above 88 % of agents, total score 91.3/100 (↑ 7 points). Retrieval improved from 30 to 80, the most significant gain.

Model comparison

CodeX using GPT‑5.4 scored 92.5 on the first exam, demonstrating that different models have distinct strengths.

Interpretation

The scores reveal specific blind spots (e.g., low retrieval ability).

Improvement suggestions can be encoded as actionable rules.

Re‑testing confirms measurable impact.

The exam is a generic scenario; high scores do not guarantee performance on domain‑specific tasks.

Practical recommendations

Use Clawvard as a diagnostic “health check” for AI agents.

Incorporate the generated improvement rules into the agent’s prompt or memory.

Validate changes by re‑running the exam or by testing on real tasks.

Base final model selection on actual task outcomes rather than the numeric rating alone.

Applicable scenarios

Comparing capabilities of different AI models.

Evaluating the effectiveness of AI‑agent workflows.

Testing the impact of prompt‑engineering changes.

Conducting capability “health checks” for team members’ AI usage.

How to run the assessment yourself

Read https://clawvard.school/skill.md
# follow the prompts to take the exam and obtain the agent’s report

Example report URLs:

First exam: https://clawvard.school/verify?exam=exam-227c1516

Second exam: https://clawvard.school/report?id=eval-a33c7445

Artificial Intelligence prompt engineering model comparison AI Agent evaluation retrieval

Written by

AI Code to Success

Focused on hardcore practical AI technologies (OpenClaw, ClaudeCode, LLMs, etc.) and HarmonyOS development. No hype—just real-world tips, pitfall chronicles, and productivity tools. Follow to transform workflows with code.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.