Artificial Intelligence 9 min read

Boost LLM Agent Performance with the Evaluator‑Optimizer Reflection Loop

This article explains the Evaluator‑Optimizer reflection pattern for LLM agents, shows how it can improve output quality in single‑ or multi‑agent tasks, and provides a step‑by‑step PydanticAI implementation with code examples and practical usage tips.

AI Large Model Application Practice

Jan 6, 2025

Boost LLM Agent Performance with the Evaluator‑Optimizer Reflection Loop

What is the Evaluator‑Optimizer reflection pattern?

Reflection is a workflow where an "enhanced LLM" generates a response and a second LLM acts as an evaluator, providing feedback. The process iterates until the output meets predefined quality criteria, dramatically improving result reliability for tasks such as translation, code generation, or report writing.

Typical use cases

Long‑text translation : self‑review style, terminology consistency, and coherence.

Code generation : check correctness, complexity, efficiency, style, and documentation.

Copy/Report creation : verify format, completeness, structure, and tone.

Possible enhancements

Evaluate from multiple roles or perspectives to obtain richer feedback.

Incorporate external knowledge bases or data sources to augment the evaluation.

Implementation with PydanticAI

1. Define response models

# Generator response model
class GeneratorResponse(BaseModel):
    thoughts: str = Field(..., description='你对任务的理解和反馈，或者你计划如何改进。')
    response: str = Field(..., description='生成的解决方案。')

# Evaluator response model
class EvaluatorResponse(BaseModel):
    thoughts: str = Field(..., description='你对提交内容的仔细和详细的审查和评估。')
    evaluation: str = Field(..., description='通过, 需要改进, 或失败')
    feedback: str = Field(..., description='需要改进的地方和原因。')

2. Define prompts and step configuration

# Define generator and evaluator steps
steps = {
    "generator": {
        "prompt": """你的目标是根据用户输入完成任务。如果你之前生成的内容收到了反馈，请根据这些反馈改进你的解决方案。""",
        "model": model,
        "result_type": GeneratorResponse
    },
    "evaluator": {
        "prompt": """请对以下代码实现进行评估，重点关注以下方面：
1. **代码正确性**：是否完全按照规范无误地实现了要求的功能？
2. **时间复杂度**：实现是否满足规定的时间复杂度要求？
3. **效率**：实现是否是针对需求最有效、最优化的方案？
4. **风格与最佳实践**：代码是否遵循标准的 Python 风格和最佳实践？
5. **可读性**：代码是否易于阅读和理解？
6. **文档化**：代码是否有清晰的文档说明，包括为所有函数和类撰写的 docstrings,以及必要的内嵌注释

注意：你应该仅评估代码，而不是尝试解决任务。
请仔细且严格地评估代码，确保不会错过任何改进的机会。
如果所有评估标准都完全满足且没有进一步改进建议,请输出“PASS”。否则,请输出“NEEDS_IMPROVEMENT”或“FAIL”，以便编码者能够学习和改进。""",
        "model": model,
        "result_type": EvaluatorResponse
    }
}

3. Generator function

async def generate(task: str, context: str = "") -> tuple[str, str]:
    """Generate and improve a solution based on feedback."""
    config = steps["generator"]
    system_prompt = config["prompt"]
    if context:
        system_prompt += f"

{context}"
    generator_agent = Agent(
        config["model"],
        system_prompt=system_prompt,
        result_type=config["result_type"]
    )
    response = await generator_agent.run(f'任务:
{task}')
    thoughts = response.data.thoughts
    result = response.data.response
    return thoughts, result

4. Evaluator function

async def evaluate(content: str, task: str) -> tuple[str, str]:
    """Assess whether a solution meets the requirements."""
    config = steps["evaluator"]
    evaluator_agent = Agent(
        config["model"],
        system_prompt=f"{config["prompt"]}

任务:
{task}",
        result_type=config["result_type"]
    )
    response = await evaluator_agent.run(content)
    evaluation = response.data.evaluation
    feedback = response.data.feedback
    return evaluation, feedback

5. Main loop

async def run(task: str, max_iterations: int = 5):
    memory = []
    iteration = 0
    thoughts, result = await generate(task)
    while iteration < max_iterations:
        evaluation, feedback = await evaluate(result, task)
        memory.append({
            "thoughts": thoughts,
            "result": result,
            "evaluation": evaluation,
            "feedback": feedback
        })
        print(f"
Iteration {iteration + 1}:
Thoughts: {thoughts}
Result: {result}
Evaluation: {evaluation}
Feedback: {feedback}")
        if evaluation == "PASS":
            return result, memory
        context = "
".join(
            ["之前的尝试:"] +
            [f"- 结果: {m['result']}
  反馈: {m['feedback']}" for m in memory]
        )
        thoughts, result = await generate(task, context)
        iteration += 1
    return result, memory

Running the loop with a concrete task shows iterative improvements; once the evaluator returns PASS , the final solution and the full reasoning trace are returned.

Conclusion

The Evaluator‑Optimizer pattern turns a single LLM call into a self‑refining workflow, works equally well in multi‑agent settings, and can be combined with higher‑level frameworks such as LangGraph or LlamaIndex to build controllable, enterprise‑grade AI systems.

LLM Reflection PydanticAI

Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.