Mastering AI Agent Reflection: The Generate‑Reflect‑Refine Loop
This article explains the Reflection design pattern for AI agents, detailing how a three‑step generate‑reflect‑refine cycle can iteratively improve outputs, provides both a simple two‑call implementation and a structured class‑based version, and shares practical tips, benchmarks, and references to the original research.
Reflection is a design pattern that enables an AI agent to "self‑review" its output, identify shortcomings, and iteratively improve the result.
What is Reflection?
The core idea is analogous to a programmer writing code, running it, encountering errors, and then fixing the code. In the AI context the steps are:
Generate : the model produces an initial answer.
Reflect : the model examines its own answer, looking for completeness, accuracy, clarity, and other issues.
Refine : based on the reflection, the model generates a better answer.
This loop can repeat until a quality threshold is reached.
Simple two‑call implementation
def reflection_agent(query, max_iterations=3):
# First generation
initial_response = llm.chat(f"请回答:{query}")
current_response = initial_response
for i in range(max_iterations):
# Reflection prompt
reflection_prompt = f"""
你刚才的回答是:{current_response}
请从以下几个方面反思这个回答:
1. 是否完整回答了问题?
2. 有没有遗漏重要信息?
3. 逻辑是否清晰?
4. 有没有可以改进的地方?
如果有改进空间,请指出具体问题。
如果已经很好了,请回复"无需改进"。
"""
reflection = llm.chat(reflection_prompt)
if "无需改进" in reflection:
break
# Refine based on reflection
refine_prompt = f"""
原始回答:{current_response}
反思意见:{reflection}
请根据反思意见,重新生成一个更好的回答。
"""
current_response = llm.chat(refine_prompt)
return current_responseThe loop follows the generate → reflect → refine cycle until the model reports no further improvements.
Structured reflection with a class
class ReflectionAgent:
def __init__(self, llm):
self.llm = llm
def generate(self, task):
"""First step: generate initial result"""
prompt = f"请完成以下任务:
{task}"
return self.llm.chat(prompt)
def reflect(self, task, output):
"""Second step: structured reflection"""
reflection_prompt = f"""
任务:{task}
你的输出:{output}
请按以下维度进行反思,输出 JSON 格式:
{{
"completeness": "是否完整(1-5分)",
"accuracy": "是否准确(1-5分)",
"clarity": "是否清晰(1-5分)",
"issues": ["问题1", "问题2"],
"suggestions": ["建议1", "建议2"],
"needs_improvement": true/false
}}
"""
response = self.llm.chat(reflection_prompt)
return json.loads(response)
def refine(self, task, output, reflection):
"""Third step: improve based on reflection"""
refine_prompt = f"""
任务:{task}
当前输出:{output}
反思意见:
- 完整性评分:{reflection['completeness']}
- 准确性评分:{reflection['accuracy']}
- 存在的问题:{', '.join(reflection['issues'])}
- 改进建议:{', '.join(reflection['suggestions'])}
请重新生成一个更好的回答。
"""
return self.llm.chat(refine_prompt)
def run(self, task, max_iterations=3, quality_threshold=4):
"""Execute the full generate‑reflect‑refine pipeline"""
output = self.generate(task)
for i in range(max_iterations):
reflection = self.reflect(task, output)
scores = [reflection['completeness'], reflection['accuracy'], reflection['clarity']]
avg_score = sum(scores) / len(scores)
if not reflection['needs_improvement'] or avg_score >= quality_threshold:
print(f"达到质量要求,停止迭代。平均分:{avg_score}")
break
print(f"第 {i+1} 轮优化,当前平均分:{avg_score}")
output = self.refine(task, output, reflection)
return outputThis version adds a quantitative scoring system, allowing the loop to stop automatically when a predefined quality threshold is met, thus avoiding wasteful token consumption.
Practical tips
Make reflection prompts specific : ask the model to evaluate completeness, accuracy, and clarity rather than a vague "check it".
Set stopping conditions : define a quality threshold so the agent stops iterating once the score is sufficient.
Keep intermediate outputs : printing each round’s result helps debug and observe improvement.
Cold knowledge
Reflection is not new : self‑evaluation has long been a core mechanism in reinforcement learning; the pattern simply transfers it to large language models.
Origin paper : Shinn et al. (2023) introduced the concept in "Reflexion: Language Agents with Verbal Reinforcement Learning".
LangGraph support : the langgraph library can model the generate‑reflect‑refine loop as a directed graph, removing the need to write explicit loops.
Different from Self‑Consistency : Self‑Consistency generates multiple answers and selects the best (breadth), whereas Reflection repeatedly polishes a single answer (depth).
Model strength matters : GPT‑4 yields far better reflection results than GPT‑3.5; weaker models may degrade performance when forced to reflect.
References
Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (2023) – https://arxiv.org/abs/2303.11366
LangChain Reflection documentation – https://python.langchain.com/docs/use_cases/code_understanding
Andrew Ng, "AI Agentic Design Patterns" – https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/
Qborfy AI
A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
