Understanding AI Agents: Core Modules, Planning Strategies, and Evaluation
This article explains what an AI agent is, outlines its four core modules—perception, memory, planning, and action—describes the role of large language models, compares software development generations, discusses memory implementations, planning methods like ReAct and Plan‑and‑Solve, and covers evaluation, cost analysis, and differences between agents and workflows.
What is an Agent
Agent is a system that can perceive the environment, make decisions, and take actions to achieve a goal.
Imagine an agent as a person: when you see a car approaching, you instinctively avoid it. The eyes capture the scene, the brain recalls that not moving could cause a collision, and you step aside. This perception‑decision‑action loop illustrates the core of an agent.
Agent Core Modules
Different teams divide an agent into four essential modules: perception, memory, planning, and action.
LLM (Large Language Model)
Large language models are the brain of an agent, designed based on neural networks that mimic human brain connections and signal transmission.
Software development has evolved through three generations:
Software 1.0 : Traditional hand‑written code by programmers.
Software 2.0 : Deep learning‑driven models where data trains neural network weights, reducing manual coding.
Software 3.0 : AI‑assisted development tools (e.g., Cursor, Trae, WinSelf) that let developers program with natural language, dramatically lowering the barrier to software creation.
Recommended reading: 《Deep Learning Revolution》 and 《A Gentle Introduction to Neural Networks and Deep Learning》 .
Perception
Perception is how an agent senses its environment. Humans use eyes and ears; robots use radar and cameras. In software, perception comes from input data such as text, images, video, audio, or files.
Memory
Large models lack intrinsic memory; external techniques like Retrieval‑Augmented Generation (RAG) or Elasticsearch simulate memory to improve task performance and interaction continuity.
Memory is divided into short‑term and long‑term types. Short‑term memory depends on the token limit of the model context, while long‑term memory relies on external storage (files, databases) and can include user profiles, situational memory, or factual knowledge.
Short‑term memory depends on token limits
Reference code for managing messages passed to the LLM:
<span>ounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(<span>lineounter</span>(line</code><code>interface ChatMessage {</code><code> role: 'user' | 'assistant' | 'system' | 'tool';</code><code> content: string;</code><code> tool_calls?: xxx;</code><code> call_id?: string;</code><code>}</code><code>class Memory {</code><code> private messages: ChatMessage[] = [];</code><code> constructor() {</code><code> this.messages = [];</code><code> }</code><code> addMessage(message: ChatMessage) {</code><code> this.messages.push(message);</code><code> }</code><code> getMessages() {</code><code> return this.messages;</code><code> }</code><code>}Long‑term memory relies on external components
Long‑term memory can be stored in files or databases, searchable and updatable. It can be further categorized into user‑profile memory, situational memory, and factual memory.
In a cake‑baking assistant case, real‑time dialogue is kept as working memory; once a threshold is reached, the LLM extracts key information to form short‑term memory. Long‑term memory stores user profiles and business insights, which are recalled and updated on each model call.
The system is organized into three layers: working memory, short‑term memory, and long‑term memory.
Each query updates short‑term memory recall counts via vector search.
Short‑term memory is extracted by the LLM as edge memory and assigned importance scores.
Action
Action is the execution part. The LLM (brain) needs tools (hands) to act. Agents must describe tools (name, description, parameters) so the LLM can decide when and how to use them.
Example: a product‑query tool that finds items in a merchant’s catalog based on a question or image.
Keyword search: extract key information from product images and titles using a multimodal model, then perform similarity search.
Image search: vectorize product images, extract the desired image link from the user query, and match it against the catalog.
const tools = [
{
"type": "function",
"function": {
"name": "query_goods",
"description": "查询蛋糕商品信息",
"parameters": {
"type": "object",
"properties": {
"keywords": {
"type": "array",
"items": {"type": "string","description": "蛋糕商品关键信息项"},
"description": "蛋糕商品关键信息,包含适用场合、目标人群、风格、口味、主题、颜色等"
},
"image_url": {"type": "string","description": "蛋糕商品图片链接"}
}
}
}
},
...
];Planning
Two common planning patterns are ReAct and Plan‑and‑Solve, with multi‑agent approaches also possible.
ReAct (Reason + Action + Feedback)
Agents iteratively reason, act, and receive feedback to achieve the user’s goal.
Cake‑baking assistant flow diagram:
Plan‑and‑Solve
Tasks are decomposed into sub‑goals first, then executed in parallel. Unlike ReAct, which proceeds step‑by‑step, Plan‑and‑Solve separates planning from execution.
Example: a travel‑planning agent first generates a plan (weather, accommodation, etc.) and then solves each sub‑task, optionally reflecting on the plan.
Agent Evaluation Report
Evaluation is essential and should record response time, completeness, tool usage, and optimization opportunities to provide data‑driven improvements.
Agent Cost Analysis
AI‑driven applications incur costs per model call, unlike traditional software. Analyzing these costs helps assess value and guide optimization.
Agent vs Workflow
Agents require only a goal description; workflows require a predefined execution path.
Choosing
Workflow : suitable for data pipelines, document‑approval processes, and other fixed, repeatable tasks that need strict order and auditability.
Agent : ideal for intelligent assistants, personal AI, complex problem diagnosis, dynamic decision‑making, personalized context understanding, creative problem solving, and conversational experiences.
FM Agent vs RL Agent
FM (Foundation Model) agents are built on large pretrained models such as GPT‑4, Claude, or Gemini, leveraging LLM capabilities for perception, reasoning, planning, and generation.
RL (Reinforcement Learning) agents learn optimal policies through interaction with an environment and reward signals, exemplified by DeepMind’s AlphaGo Zero.
Conclusion
Agents, LLMs, and workflows are tools; their true value emerges when they are applied to real business scenarios, solve concrete problems, and create measurable impact.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
