Mastering AI Agent Reflection: The Generate‑Reflect‑Refine Loop

This article explains the Reflection design pattern for AI agents, detailing how a three‑step generate‑reflect‑refine cycle can iteratively improve outputs, provides both a simple two‑call implementation and a structured class‑based version, and shares practical tips, benchmarks, and references to the original research.

Qborfy AI
Qborfy AI
Qborfy AI
Mastering AI Agent Reflection: The Generate‑Reflect‑Refine Loop

Reflection is a design pattern that enables an AI agent to "self‑review" its output, identify shortcomings, and iteratively improve the result.

What is Reflection?

The core idea is analogous to a programmer writing code, running it, encountering errors, and then fixing the code. In the AI context the steps are:

Generate : the model produces an initial answer.

Reflect : the model examines its own answer, looking for completeness, accuracy, clarity, and other issues.

Refine : based on the reflection, the model generates a better answer.

This loop can repeat until a quality threshold is reached.

Simple two‑call implementation

def reflection_agent(query, max_iterations=3):
    # First generation
    initial_response = llm.chat(f"请回答:{query}")
    current_response = initial_response
    for i in range(max_iterations):
        # Reflection prompt
        reflection_prompt = f"""
        你刚才的回答是:{current_response}
        请从以下几个方面反思这个回答:
        1. 是否完整回答了问题?
        2. 有没有遗漏重要信息?
        3. 逻辑是否清晰?
        4. 有没有可以改进的地方?
        如果有改进空间,请指出具体问题。
        如果已经很好了,请回复"无需改进"。
        """
        reflection = llm.chat(reflection_prompt)
        if "无需改进" in reflection:
            break
        # Refine based on reflection
        refine_prompt = f"""
        原始回答:{current_response}
        反思意见:{reflection}
        请根据反思意见,重新生成一个更好的回答。
        """
        current_response = llm.chat(refine_prompt)
    return current_response

The loop follows the generate → reflect → refine cycle until the model reports no further improvements.

Structured reflection with a class

class ReflectionAgent:
    def __init__(self, llm):
        self.llm = llm
    def generate(self, task):
        """First step: generate initial result"""
        prompt = f"请完成以下任务:
{task}"
        return self.llm.chat(prompt)
    def reflect(self, task, output):
        """Second step: structured reflection"""
        reflection_prompt = f"""
        任务:{task}
        你的输出:{output}
        请按以下维度进行反思,输出 JSON 格式:
        {{
            "completeness": "是否完整(1-5分)",
            "accuracy": "是否准确(1-5分)",
            "clarity": "是否清晰(1-5分)",
            "issues": ["问题1", "问题2"],
            "suggestions": ["建议1", "建议2"],
            "needs_improvement": true/false
        }}
        """
        response = self.llm.chat(reflection_prompt)
        return json.loads(response)
    def refine(self, task, output, reflection):
        """Third step: improve based on reflection"""
        refine_prompt = f"""
        任务:{task}
        当前输出:{output}
        反思意见:
        - 完整性评分:{reflection['completeness']}
        - 准确性评分:{reflection['accuracy']}
        - 存在的问题:{', '.join(reflection['issues'])}
        - 改进建议:{', '.join(reflection['suggestions'])}
        请重新生成一个更好的回答。
        """
        return self.llm.chat(refine_prompt)
    def run(self, task, max_iterations=3, quality_threshold=4):
        """Execute the full generate‑reflect‑refine pipeline"""
        output = self.generate(task)
        for i in range(max_iterations):
            reflection = self.reflect(task, output)
            scores = [reflection['completeness'], reflection['accuracy'], reflection['clarity']]
            avg_score = sum(scores) / len(scores)
            if not reflection['needs_improvement'] or avg_score >= quality_threshold:
                print(f"达到质量要求,停止迭代。平均分:{avg_score}")
                break
            print(f"第 {i+1} 轮优化,当前平均分:{avg_score}")
            output = self.refine(task, output, reflection)
        return output

This version adds a quantitative scoring system, allowing the loop to stop automatically when a predefined quality threshold is met, thus avoiding wasteful token consumption.

Practical tips

Make reflection prompts specific : ask the model to evaluate completeness, accuracy, and clarity rather than a vague "check it".

Set stopping conditions : define a quality threshold so the agent stops iterating once the score is sufficient.

Keep intermediate outputs : printing each round’s result helps debug and observe improvement.

Cold knowledge

Reflection is not new : self‑evaluation has long been a core mechanism in reinforcement learning; the pattern simply transfers it to large language models.

Origin paper : Shinn et al. (2023) introduced the concept in "Reflexion: Language Agents with Verbal Reinforcement Learning".

LangGraph support : the langgraph library can model the generate‑reflect‑refine loop as a directed graph, removing the need to write explicit loops.

Different from Self‑Consistency : Self‑Consistency generates multiple answers and selects the best (breadth), whereas Reflection repeatedly polishes a single answer (depth).

Model strength matters : GPT‑4 yields far better reflection results than GPT‑3.5; weaker models may degrade performance when forced to reflect.

References

Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (2023) – https://arxiv.org/abs/2303.11366

LangChain Reflection documentation – https://python.langchain.com/docs/use_cases/code_understanding

Andrew Ng, "AI Agentic Design Patterns" – https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/

code generationAI agentsLLMprompt engineeringReflectioniteration
Qborfy AI
Written by

Qborfy AI

A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.