How Browser‑Use Leverages AI Prompts for Seamless Browser Automation

This article explains how the open‑source browser‑use framework combines carefully designed SystemMessage prompts, structured HumanMessage inputs, and LangChain‑driven tool calls to enable large language models to automate complex web tasks such as shopping, CRM updates, résumé processing, and document generation, while providing concrete code examples and best‑practice tips.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Browser‑Use Leverages AI Prompts for Seamless Browser Automation

1. Introduction to browser-use

Browser-use is an open‑source AI‑driven browser automation framework that excels at converting high‑level tasks into step‑by‑step actions using large language models (LLMs). It has gained strong community traction (over 63k GitHub stars) and can perform tasks like adding items to a cart, syncing LinkedIn contacts to Salesforce, parsing resumes for machine‑learning jobs, drafting thank‑you letters, and searching Hugging Face models.

Question: How does browser-use achieve this?

Answer: By mastering interaction techniques with LLMs.

2. Analysis of Interaction Mechanism

The power of browser-use lies in its sophisticated prompt design and the combination of various message types that guide the LLM.

3.1 Complete Input

An example task is to open a URL and log in with a username and password. Browser-use breaks the task into sub‑goals and interacts with the LLM repeatedly.

<span>[</span><span>  SystemMessage(content='You are an AI agent designed to automate browser tasks. Your goal is to accomplish the ultimate task following the rules...')</span><span>,</span><span>  HumanMessage(content='Your ultimate task is: "1. Open https://one.console.con.env136.shuguang.com 2. Enter username and password"...')</span><span>,</span><span>  HumanMessage(content='Example output:')</span><span>,</span><span>  AIMessage(...)</span><span>]</span>

3.2 SystemMessage

SystemMessage defines the AI’s role, conversation rules, and output format. It is written in Markdown and converted to a string before being sent to the model.

SystemMessage example
SystemMessage example

3.2.1 Identity Specification

The prompt tells the model: "You are an AI agent designed to automate browser tasks. Your goal is to follow the rules and complete the ultimate task."

3.2.2 Input Format

The framework defines a structured input consisting of Task, Previous steps, Current URL, Open Tabs, and Interactive Elements, e.g.:

[index]<type>text</type>

3.2.3 Response Rules

The LLM must always return a JSON object with current_state (including evaluation_previous_goal, memory, next_goal) and an action list describing concrete browser operations.

3.3 HumanMessage

HumanMessage carries the user’s natural‑language instructions, task history markers, and any auxiliary data (images, timestamps). It can be split across multiple messages; the framework uses markers like [Your task history memory starts here] and [Task history memory ends] to delimit historical context.

Task history illustration
Task history illustration

3.4 AIMessage

AIMessage is the model’s direct response, often containing the JSON output defined by the SystemMessage. It may also include example outputs to guide the model.

{"current_state": {"evaluation_previous_goal": "Success", "memory": "Navigated to login page...", "next_goal": "Input username and password"}, "action": [{"input_text": {"index": 5, "text": "admin"}}, {"input_text": {"index": 6, "text": "****"}}, {"click_element": {"index": 10}}]}

3.5 ToolMessage

ToolMessage records the execution result of a tool call (e.g., navigating to a URL, clicking an element) and feeds it back to the LLM for further reasoning.

ToolMessage(content='', tool_call_id='2')

3.6 Full Output Example

The framework uses LangChain’s with_structured_output to enforce the JSON schema, producing a response that includes raw LLM output, parsed data, and token usage metadata.

{"raw": AIMessage(...), "parsed": AgentOutput(...), "parsing_error": null}

3. Conclusion

Browser-use demonstrates effective techniques for LLM‑driven browser automation, including markdown‑based system prompts, explicit input schemas, structured JSON responses, and comprehensive message logging. These practices can be adapted to other LLM interaction scenarios to improve reliability and transparency.

prompt engineeringLangChainLarge Language Modelbrowser automationAI automationstructured output
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.