Building a Fully Autonomous AI Data Analyst: Agent Architecture & Planning
This article explores how to create a self‑thinking AI data analyst by detailing agent fundamentals, core modules such as planning, memory and tool scheduling, practical development steps, multi‑agent collaboration, evaluation benchmarks, and real‑world examples like stock backtesting.
Agent Overview
An AI Agent (or autonomous agent) is a system that can perceive its environment, make decisions, and execute tasks to achieve specific goals. In the context of data analysis, an agent acts as a virtual analyst that can retrieve data, run queries, perform calculations, and generate reports without writing code.
Core Architecture and Modules
1.1 What Is an Agent?
An agent consists of four core capabilities:
Environment perception – acquiring multimodal data through sensors or APIs.
Intelligent decision‑making – using deep learning or reinforcement learning to choose actions.
Task execution – invoking tools or APIs to perform concrete operations.
Continuous learning – online or transfer learning to improve over time.
1.2 Agent Framework
The agent’s brain is a large language model (LLM) that provides planning, reasoning, and language generation. Supporting modules include:
Planning : decomposes complex tasks, creates execution plans, and coordinates subtasks.
Memory System : short‑term, mid‑term, and long‑term stores to overcome the limited context window of LLMs.
Tool/Function Scheduling : uses function‑calling or MCP protocols to invoke external services.
1.3 Agent Classification
Agents can be grouped into four paradigms:
Reflection (e.g., ReAct, Self‑Refine) – the model iteratively reflects on its actions.
Tool Use – the model calls external functions or APIs.
Planning – the model creates a hierarchical plan before execution.
Multi‑Agent Collaboration – multiple agents coordinate via protocols such as A2A.
Planning Module
Planning improves efficiency and accuracy for complex tasks. Key techniques include Hierarchical Task Networks (HTN) and Monte‑Carlo Tree Search. A typical planning loop follows the ReAct pattern: thought → action → observation → answer , similar to the PDCA cycle.
Memory System
Memory addresses the LLM’s limited context window by storing and retrieving information at three levels:
Short‑Term Memory (STM) : recent dialogue directly embedded in the prompt.
Mid‑Term Memory (MTM) : summarized topic chunks, often using RAG techniques.
Long‑Term Memory (LTM) : persistent knowledge bases or vector databases accessed via retrieval‑augmented generation.
Tool / Function Scheduling
Function calling transforms natural‑language intent into executable API calls. It solves two major problems: outdated knowledge (e.g., asking about 2025 events) and lack of real‑world actions. Typical workflow:
User request → LLM decides whether a tool is needed.
If needed, LLM outputs a function_call with name and parameters.
The host system validates permissions and executes the function.
Result is fed back to the LLM, which generates a final natural‑language answer.
Common pitfalls include missing logs, connection interruptions, performance bottlenecks, and inconsistent output schemas.
Example Code (Python)
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("API_KEY"),
base_url="BASE_URL",
)
def function_calling(messages, tools):
completion = client.chat.completions.create(
model="qwen-plus",
messages=messages,
tools=tools,
)
print("Response:")
print(completion.choices[0].message.model_dump_json())
return completionMCP (Multi‑Channel Protocol)
MCP standardises tool registration, invocation, and result handling. It separates the host (application) from the tool server, providing security checks and unified JSON schemas. The workflow is:
User → Host sends request + tool list to MCP client.
MCP client asks LLM to generate a tool‑call plan.
Host approves the plan (permission, risk checks).
Host forwards the call to the MCP server, which executes the tool.
Result travels back through MCP server → Host → MCP client → LLM → User.
Evaluation Benchmarks
Current evaluation of agents relies on a variety of benchmarks covering data analysis, multi‑turn dialogue, and tool use. Representative datasets include:
AgentBench (arXiv:2308.03688)
InfoQuest (arXiv:2502.12257)
MINT (2023)
ToolBench, GTA, ToolDial, WorkBench, DataSciBench, etc.
Practical Example: Stock Backtesting
The article demonstrates the agent’s capability with a stock‑backtesting scenario, where the agent automatically retrieves market data, runs a simulation, and produces a concise report—all without user‑written code.
Key Takeaways
Agent effectiveness depends on the synergy of planning, memory, and tool scheduling.
Context engineering (prompt design, KV‑cache optimisation) is crucial for latency and cost.
Dynamic constraints (logits masking, state‑machine management) keep tool selection tractable.
File‑system extensions and layered feedback improve scalability for long inputs.
Human‑in‑the‑loop should be triggered at well‑defined risk thresholds.
References
[1] AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688.
[2] InfoQuest: Evaluating Multi‑Turn Dialogue Agents for Open‑Ended Information Seeking. arXiv:2502.12257.
[3] MINT: A New Benchmark Tailored for LLMs' Multi‑Turn Interactions. Blog post.
[4] ToolBench: An Instruction‑tuning Dataset for Tool Use. Papers With Code.
[5] GTA: A Benchmark for General Tool Agents. arXiv:2407.08713.
[6] ToolDial: Multi‑turn Dialogue Generation Method for Tool‑Augmented LMs. arXiv:2503.00564.
[7] AgentBoard: An Analytical Evaluation Board of Multi‑turn LLM Agents. NeurIPS 2024.
[8] WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting. arXiv:2405.00823.
[9] DataSciBench: An LLM Agent Benchmark for Data Science. arXiv:2502.13897.
[10] A Survey of Large Language Models.
[11] Understanding the Planning of LLM Agents: A Survey. arXiv:2402.02716.
[12] MemoryOS.
[13] AutoGen: Enabling Next‑Gen LLM Applications via Multi‑Agent Conversation.
[14] LangChain documentation.
[15] Context Engineering Survey for LLMs. arXiv:2507.13334.
[16] MetaGPT GitHub repository.
[17] CrewAI GitHub repository.
[18] BabyAGI GitHub repository.
[19] LlamaIndex documentation.
[20] Semantic Kernel GitHub repository.
[21] Auto‑GPT GitHub repository.
[22] AgentGPT website.
[23] Langroid GitHub repository.
[24] Haystack documentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
