Artificial Intelligence 28 min read

What Makes an AI Agent Tick? From Expert Systems to Modern Architectures

This article traces the evolution of AI agents from early expert systems to today’s multimodal, memory‑rich agents, explains their perception, reasoning, memory and action modules, discusses model selection, prompt engineering, RAG techniques, and highlights current limitations such as hallucinations, reliability, cost, and security.

Architecture and Beyond

Jul 27, 2025

What Makes an AI Agent Tick? From Expert Systems to Modern Architectures

In recent days, news about Manus has been constantly appearing in my feed, from the company clearing all its content on Weibo and Xiaohongshu, to lay‑off controversies, and reports that core engineers were moved to Singapore. A pure Chinese company that started in Beijing and Wuhan ultimately chose to leave.

Recall March this year, Manus’s launch ignited social networks. Invitation codes were scarce and even sold at premium prices on second‑hand platforms. Founder Xiao Hong (nicknamed "red") led a young team that sparked industry enthusiasm for AI agents with a single product.

2025 is the year of AI agents. Their development speed is astonishing. Not only general‑purpose agents like Manus, but also vertical agents such as Lovart for design and Claude Code Cursor for programming.

So what is an AI agent and what does it consist of? Let’s discuss.

1. Starting from Expert Systems

To talk about the history of AI agents, we must go back to the 1960s. At that time, computer scientists wondered whether a machine could perceive its environment, make decisions, and act like a human.

The earliest attempts were expert systems. For example, in the 1970s Stanford developed MYCIN, a system that diagnosed blood infections. It worked by asking a doctor a series of questions and then giving a diagnosis based on predefined rules. Though primitive by today’s standards, it was considered "intelligent" back then.

In the 1980s, more complex systems like R1/XCON emerged to help DEC configure computer systems. These rule‑based systems required engineers to anticipate every possible situation and encode it as if‑then rules, which proved impractical for the real world.

In the 1990s, researchers shifted to machine‑learning agents. Instead of hand‑crafting all rules, machines learned from data, giving rise to reinforcement‑learning agents that improved through trial and error.

The real turning point arrived in the 2010s with the rise of deep learning. The 2017 introduction of the Transformer architecture changed the game. Large language models such as GPT and BERT made AI agents much smarter; they no longer needed hand‑written rules but could understand natural language and make context‑aware judgments.

2. The Shape of Modern AI Agents

To understand modern AI agents, we first need to define what they are.

AI Agent is a system that can autonomously perceive its environment, devise a plan, and take actions to achieve a specific goal.

For example, if you ask an AI to book a flight from Beijing to Shanghai, a simple chatbot might reply, "Please log in to the airline website yourself." A true AI agent would:

Understand your requirements (date, budget, preferences).

Search flight information from multiple airlines.

Compare prices and times.

Filter results according to your preferences.

Potentially complete the booking if an appropriate API is available.

The difference between an AI agent and a regular AI application is that the former actively solves problems rather than passively answering questions.

3. Core Technologies and Architecture

Now let’s examine how AI agents are built. The architecture can be divided into four parts.

3.1 Perception Module

This is the "eyes" and "ears" of the agent. It must understand user input and perceive the state of the environment. For instance, when an AI agent helps you write code, it needs to grasp:

What functionality you want to implement.

Which programming language you are using.

Any special requirements.

The current code structure.

The perception module must distinguish two kinds of context: state context and intent context.

State context is objective information about the environment, e.g.:

Project uses Python 3.9.

Codebase already includes an authentication module.

Database is MySQL.

Framework is FastAPI.

These are factual, verifiable details that the agent must retrieve accurately.

Intent context reflects the user’s goal, which is often vague and subjective. For example, when a user says "optimize this code," the agent must infer whether the user wants performance or readability improvements, the expected performance target, and any constraints.

Confusing state and intent is a common source of failure. For instance, if a user says "this function is too slow," the agent should recognize the state (function runs in 500 ms) and the intent (reduce to under 100 ms).

Modern agents enhance perception through:

Multimodal perception : handling text, images, audio, and video.

Proactive questioning : asking clarifying questions when information is insufficient.

History analysis : leveraging past user behavior to infer current intent.

Environment probing : inspecting configuration files, dependencies, test suites, etc., before acting.

Accurate perception is crucial; a driver who cannot see the road cannot drive well, just as an agent that misperceives context will fail downstream.

3.2 Reasoning Module

This is the "brain"—typically a large language model (LLM). Popular choices include GPT‑4, Claude, and Gemini. However, merely picking a big model is insufficient; understanding each model’s characteristics is essential.

Model personality differences : Like humans, models have distinct "personalities" based on training objectives.

Examples of "thinking" models:

Claude 3 Opus – prefers global understanding and proactively infers intent.

Gemini 2.0 Flash – confident, often makes bold changes.

o1 – designed for deep reasoning, takes time to analyze complex problems.

These models act like expert consultants, suitable for exploratory or large‑scale tasks.

Examples of "execution" models:

Claude 3.5 Sonnet – waits for explicit instructions, avoids over‑inference.

GPT‑4 Turbo – predictable behavior, ideal for precise control.

Wenxin Yiyan 4.0 – stable performance on Chinese tasks.

Choosing a model is akin to selecting the right tool: use a hammer for nails, a screwdriver for screws.

Task‑type selection :

Code generation – Claude 3.5 Sonnet or GPT‑4.

Code understanding/re‑structuring – Gemini 2.0 Flash (long context).

Complex bug debugging – o1 (deep reasoning).

Chinese document processing – Tongyi Qianwen, Doubao.

Interaction‑style selection :

Prefer detailed instructions? Choose an execution model.

Prefer giving high‑level direction? Choose a thinking model.

Need creative solutions? Choose a more "active" model.

Need stable output? Choose a more "conservative" model.

Prompt engineering evolution : Early agents required long, detailed prompts. Nowadays a simple "optimize this code" often suffices.

你是一个专业的Python开发者。请严格遵循PEP8规范。
在编写代码时，请考虑以下几点：
1. 代码的可读性
2. 性能优化
3. 错误处理
...（还有20条）

Modern agents dynamically adjust prompts based on context, using detailed background for the first turn and incremental information thereafter.

Hybrid model strategies : Many agents combine multiple models in a pipeline—e.g., Claude 3 Opus for understanding, o1 for planning, GPT‑4 Turbo for execution, followed by a specialized code model for fine‑tuning.

3.3 Memory Module

Just as humans need to remember previous events, agents need memory. Memory is divided into:

Short‑term memory – current conversation context.

Long‑term memory – historical dialogues and learned knowledge.

Working memory – intermediate state during a task.

Implementing a robust memory system is complex.

3.3.1 Hierarchical Memory

Agents require layered memory similar to the human brain:

Sensory Memory : lasts seconds to minutes; stores immediate user input and system output; used for handling pronouns and short‑term references.

Duration: seconds‑to‑minutes.

Content: recent utterances.

Purpose: resolve immediate references.

Working Memory : lasts the whole task; holds current state, intermediate results, and to‑do items.

Duration: entire task.

Content: task state, partial results.

Purpose: step‑by‑step execution.

Episodic Memory : lasts days to months; stores full dialogue histories and execution logs.

Duration: days‑to‑months.

Content: complete conversations, task records.

Purpose: understand preferences, avoid repeated mistakes.

Semantic Memory : permanent; contains domain knowledge, best practices, and learned patterns.

Duration: permanent.

Content: knowledge, patterns.

Purpose: accumulate experience, improve capability.

3.3.2 Retrieval‑Augmented Generation (RAG)

The most mature solution today is RAG: when an agent needs an answer, it first retrieves relevant information from a knowledge base and feeds that context to the LLM.

Example: asking "What is our company's annual leave policy?" triggers a retrieval of the policy document, then the agent generates a response based on that content.

RAG evolution :

First generation (2020‑2022): simple vector retrieval using BERT or Sentence‑BERT, top‑K recall, limited relevance.

Second generation (2023‑2024): hybrid retrieval (vector + keyword), stronger encoders (BGE, E5), reranking, document‑structure awareness.

Third generation (2024‑present): multi‑level indexes (summary → section → paragraph), query rewriting, dynamic context windows, knowledge‑graph enhancement.

Practical RAG optimization includes intelligent chunking (by function, chapter, or dialogue turn), multi‑path retrieval (vector, BM25, entity linking, graph), context engineering, incremental indexing, and hot‑updates.

3.3.3 Long‑Context Handling

Recent models support extremely long contexts—Claude 3 up to 200 K tokens, Gemini 1.5 Pro up to 2 M tokens.

Challenges of long context :

"Lost in the middle" – models remember beginnings and ends better than middle sections.

Attention dilution – longer context spreads attention thinly.

Reasoning degradation – inference quality drops with very long inputs.

Hybrid approaches combine long context with selective attention to mitigate these issues.

3.3.4 Active Forgetting

Effective agents learn to forget rather than retain everything.

Why forget?

Noise reduction – not all information is valuable.

Privacy – sensitive data must be removed.

Efficiency – keep the memory system performant.

Relevance – outdated info can cause harm.

Forgetting strategies :

Time‑based: delete temporary info after 24 h, archive low‑frequency memory after 30 days, purge erroneous records after 90 days.

Importance‑based: LRU eviction, dynamic scoring by access frequency, retain high‑value "key moments".

Relevance‑based: replace conflicting new info, merge similar memories, periodically compress the memory store.

3.4 Action Module

This is the "hands" of the agent, enabling it to actually do things.

The core of the action module is Function Calling, the dominant method today. Pre‑defined functions (search web, query DB, send email, etc.) are described to the LLM with their signatures. When a user request arrives, the model decides which function to invoke, extracts arguments, executes it, and returns the result.

Function calling has evolved from single calls to multi‑step, parallel, and retry‑capable workflows.

Anthropic’s Model Context Protocol (MCP) proposes a unified standard: a server provides tools, clients (AI applications) consume them via a standardized protocol, enabling decoupling, reuse, and unified security/permission management.

Safety is paramount. Early Manus deployments suffered a breach where a prompt caused the agent to package the entire execution environment’s code. Modern agents employ sandboxed containers, resource limits, role‑based permissions, and audit logs to prevent such abuse.

Complex tasks often require orchestrating multiple actions, which is handled by workflow engines that manage step order, data passing, error handling, branching, loops, and parallelism.

Future directions include autonomous discovery of new APIs, learning new actions, and physical world interaction via robots or IoT devices, moving agents from "talking" to "doing".

4. Current Limitations

Understanding the limits of AI agents is essential for responsible use and future improvement.

4.1 Hallucination Issues

All LLM‑based agents suffer from hallucinations: fabricating nonexistent APIs, inventing results, or over‑estimating capabilities. In multi‑step tasks, a small error can cascade and collapse the entire workflow. Detecting hallucinations is hard because they often appear plausible.

4.2 Reliability Deficits

Agent performance can vary across runs due to model randomness, context misinterpretation, or environmental changes. High‑stakes domains such as finance, healthcare, or industrial control still cannot rely on agents alone; they remain assistive tools.

4.3 Cost and Efficiency

Running a full‑featured AI agent is expensive. Model invocation costs (especially GPT‑4, Claude) accumulate quickly with multiple calls per task. Latency can reach tens of seconds or minutes, making real‑time use challenging. Techniques like model distillation and caching help but trade‑offs remain.

4.4 Security and Privacy Challenges

Agents need extensive data access, raising risks of data leakage (sending sensitive info to LLMs), prompt injection attacks, and permission abuse. Industry mitigations include differential privacy, sandboxing, fine‑grained access control, and continuous auditing, yet the arms race continues.

4.5 Understanding and Reasoning Limits

Despite impressive abilities, agents still lack deep commonsense reasoning and struggle with long‑chain inference or creative problem solving. They excel at straightforward tasks like booking a flight but falter on complex itinerary planning that requires budgeting, timing, and personal preferences. Even advanced models like o1 fall short of human‑level reasoning.

5. Closing Thoughts

The development of AI agents is just beginning. Two years ago we marveled at ChatGPT’s conversational abilities; today agents can write code, analyze data, and devise plans. For technologists, this is an unprecedented opportunity to shape the future.

However, we must stay grounded: AI agents are tools, not magic. They boost efficiency but cannot replace human creativity and judgment.

In the future, everyone may have a personal AI‑agent team, just as smartphones are now ubiquitous. The present moment marks the start of that future.

RAG security Large Language Model AI Agent Function Calling memory architecture

Written by

Architecture and Beyond

Focused on AIGC SaaS technical architecture and tech team management, sharing insights on architecture, development efficiency, team leadership, startup technology choices, large‑scale website design, and high‑performance, highly‑available, scalable solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.