From Chat to Autonomous Agents: Architecture, ReAct, Prompt Engineering
This article chronicles the evolution from simple chat interactions to sophisticated autonomous agents, detailing stages of LLM development, ReAct reasoning, memory management, tool integration, and practical implementation using the browser-use project, while offering prompt design insights and future directions for AI agents.
Background
This post is a personal learning summary of agents, aiming to understand how an agent product runs from an engineering perspective.
LLM Understanding Stages
Stage 1: Chat only
Simple text input-output interaction. Prompt engineering (COT, ReAct) improves reasoning; RAG mitigates hallucinations by retrieving external knowledge. Applications include emotional companions, role‑play, copy generation, and Copilot‑style assistance.
Stage 2: Workflow orchestration
Function calls give LLMs stable outputs and tool‑using capabilities. Low‑code platforms like Coze and Dify enable users to compose agents via workflows, reducing development cost.
Stage 3: Agent
Agents perceive and act in environments autonomously. Users describe goals, and the LLM uses tools and context to plan, execute, and generate code in sandboxed environments, greatly boosting productivity.
ReAct Framework
ReAct mirrors human problem solving: Thought → Action → Observation. The typical flow is illustrated below.
Agent Architecture (browser‑use)
Core Components
Agent Core : Coordinates components, manages task flow, and ensures correct communication.
MessageManager : Handles all LLM communication (system prompts, user messages, tool outputs).
Memory : Provides short‑term and long‑term memory, using caching or vector databases.
LLM Interface : Sends/receives messages to the language model.
Controller : Executes browser actions and registers tools.
BrowserContext : Manages browser sessions, DOM operations, and page state.
Execution Flow
The process follows ReAct: generate Thought, call a tool (Action), observe result, and iterate until the goal is reached.
sytem_prompt = {"previousGoal": "...", "memory": "...", "next_goal": "...", "actions": "..."}
tools = Tools()
context = Context(tools)
agent = Agent(sytem_prompt, context)
while not context.finished:
status, actions = agent.run(context)
tools.run(actions)Memory Module
Memory consists of short‑term (recent conversation, tool info) and long‑term (vector store) components. The MessageManager records all messages, and the mem0 framework summarizes history to keep token usage low.
class MessageMetadata:
token_count: int
message_type: str
class ManagedMessage:
content: str
metadata: MessageMetadataPrompt Design & Structured Output
The system prompt defines a strict JSON output schema, includes examples, and uses Pydantic for validation. Example schema:
{
"current_state": {
"evaluation_previous_goal": "...",
"memory": "...",
"next_goal": "..."
},
"action": [{"click_element": {"index": 0}}]
}Three output handling modes are provided:
raw : Parse JSON from raw model output.
functionCall : Use OpenAI‑style function calls.
structured : LangChain’s structured output with Pydantic validation.
Error Handling
If validation fails, the agent captures the error, inserts a message with details, and retries, encouraging the model to correct its format.
if isinstance(error, ValidationError):
return f"Invalid model output format. Details: {str(error)}"Tool Registration & Invocation
Tools are registered in a registry with name, description, and parameter schema. During execution, the agent selects a tool based on the model’s output and calls it with assembled arguments.
@tool(name="click_element", description="Click a button", params=ClickParams)
def click_element(params):
# implementation
passMCP Integration
MCP (Model Context Protocol) standardizes tool exposure. Two integration approaches are discussed:
System‑prompt method: expose MCP tools via a local use_mcp tool.
function‑call method: wrap MCP tools as regular function‑call tools.
Integrating MCP into browser‑use would replace local tool registration with remote MCP services, allowing dynamic tool discovery.
Coze Space
Coze Space (launched April 19) offers three core capabilities: task automation, expert‑agent ecosystem, and MCP integration. It supports two agent modes:
Exploration mode : Interleaved planning and execution (similar to ReAct).
Planning mode : High‑level plan first, then execute sub‑tasks sequentially.
Both modes improve flexibility and user interaction.
Conclusion & Outlook
The article reflects on current agent designs, proposes future features such as self‑planning, hierarchical planning, rethink mechanisms, and better human‑AI interaction. It emphasizes the importance of stable output, distributed memory, multi‑model compatibility, and prompt security for sustainable agent deployment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
