From Chat to Autonomous Agents: Architecture, ReAct, Prompt Engineering

This article chronicles the evolution from simple chat interactions to sophisticated autonomous agents, detailing stages of LLM development, ReAct reasoning, memory management, tool integration, and practical implementation using the browser-use project, while offering prompt design insights and future directions for AI agents.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
From Chat to Autonomous Agents: Architecture, ReAct, Prompt Engineering

Background

This post is a personal learning summary of agents, aiming to understand how an agent product runs from an engineering perspective.

LLM Understanding Stages

Stage 1: Chat only

Simple text input-output interaction. Prompt engineering (COT, ReAct) improves reasoning; RAG mitigates hallucinations by retrieving external knowledge. Applications include emotional companions, role‑play, copy generation, and Copilot‑style assistance.

Stage 2: Workflow orchestration

Function calls give LLMs stable outputs and tool‑using capabilities. Low‑code platforms like Coze and Dify enable users to compose agents via workflows, reducing development cost.

Stage 3: Agent

Agents perceive and act in environments autonomously. Users describe goals, and the LLM uses tools and context to plan, execute, and generate code in sandboxed environments, greatly boosting productivity.

ReAct Framework

ReAct mirrors human problem solving: Thought → Action → Observation. The typical flow is illustrated below.

ReAct flow diagram
ReAct flow diagram

Agent Architecture (browser‑use)

Core Components

Agent Core : Coordinates components, manages task flow, and ensures correct communication.

MessageManager : Handles all LLM communication (system prompts, user messages, tool outputs).

Memory : Provides short‑term and long‑term memory, using caching or vector databases.

LLM Interface : Sends/receives messages to the language model.

Controller : Executes browser actions and registers tools.

BrowserContext : Manages browser sessions, DOM operations, and page state.

Execution Flow

The process follows ReAct: generate Thought, call a tool (Action), observe result, and iterate until the goal is reached.

sytem_prompt = {"previousGoal": "...", "memory": "...", "next_goal": "...", "actions": "..."}
tools = Tools()
context = Context(tools)
agent = Agent(sytem_prompt, context)
while not context.finished:
    status, actions = agent.run(context)
    tools.run(actions)

Memory Module

Memory consists of short‑term (recent conversation, tool info) and long‑term (vector store) components. The MessageManager records all messages, and the mem0 framework summarizes history to keep token usage low.

class MessageMetadata:
    token_count: int
    message_type: str

class ManagedMessage:
    content: str
    metadata: MessageMetadata

Prompt Design & Structured Output

The system prompt defines a strict JSON output schema, includes examples, and uses Pydantic for validation. Example schema:

{
  "current_state": {
    "evaluation_previous_goal": "...",
    "memory": "...",
    "next_goal": "..."
  },
  "action": [{"click_element": {"index": 0}}]
}

Three output handling modes are provided:

raw : Parse JSON from raw model output.

functionCall : Use OpenAI‑style function calls.

structured : LangChain’s structured output with Pydantic validation.

Error Handling

If validation fails, the agent captures the error, inserts a message with details, and retries, encouraging the model to correct its format.

if isinstance(error, ValidationError):
    return f"Invalid model output format. Details: {str(error)}"

Tool Registration & Invocation

Tools are registered in a registry with name, description, and parameter schema. During execution, the agent selects a tool based on the model’s output and calls it with assembled arguments.

@tool(name="click_element", description="Click a button", params=ClickParams)
def click_element(params):
    # implementation
    pass

MCP Integration

MCP (Model Context Protocol) standardizes tool exposure. Two integration approaches are discussed:

System‑prompt method: expose MCP tools via a local use_mcp tool.

function‑call method: wrap MCP tools as regular function‑call tools.

Integrating MCP into browser‑use would replace local tool registration with remote MCP services, allowing dynamic tool discovery.

Coze Space

Coze Space (launched April 19) offers three core capabilities: task automation, expert‑agent ecosystem, and MCP integration. It supports two agent modes:

Exploration mode : Interleaved planning and execution (similar to ReAct).

Planning mode : High‑level plan first, then execute sub‑tasks sequentially.

Both modes improve flexibility and user interaction.

Conclusion & Outlook

The article reflects on current agent designs, proposes future features such as self‑planning, hierarchical planning, rethink mechanisms, and better human‑AI interaction. It emphasizes the importance of stable output, distributed memory, multi‑model compatibility, and prompt security for sustainable agent deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMMCPPrompt EngineeringReacttool integrationAI AgentMemory
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.