Artificial Intelligence 22 min read

How We Built an AI‑Powered Data Agent to Automate Data Retrieval at Scale

This article details the design and implementation of Matra, an AI‑driven data assistant for a large e‑commerce platform, covering the challenges of legacy data assets, knowledge‑base construction, GraphRAG integration, multi‑stage agent frameworks, practical results, and future plans for continuous improvement.

Alibaba Cloud Developer

Dec 16, 2025

How We Built an AI‑Powered Data Agent to Automate Data Retrieval at Scale

Background

Rapid advances in large language models (LLMs) have expanded AI capabilities beyond image recognition to complex decision‑making and autonomous execution. In this context, the "Data Agent" concept emerged to address inefficiencies in data‑driven workflows at a ten‑year‑old e‑commerce data platform that now holds tens of thousands of tables and scheduling nodes.

Governance challenges : inconsistent naming, duplicate semantics, and orphaned "zombie" tables.

Knowledge gaps : critical business logic (metrics, dimensions, dependencies) scattered across scripts, reports, and documents.

Team overload : data engineers juggle site maintenance, ad‑hoc queries, and new feature development, limiting focus on core model building.

AI‑Driven Demand

Front‑line business users (e.g., product assistants) spend extensive manual effort aggregating data from multiple systems for daily reports. Three main pain points were identified:

High labor cost : repetitive data extraction and cleaning.

Low efficiency : cumbersome cross‑system workflows.

To transform "human‑fetch‑data" into "AI‑fetch‑analyze‑use", the Matra project was launched, aiming to enable natural‑language interaction for data retrieval and analysis.

Knowledge‑Base Design

The knowledge base needed to be:

No data‑model reconstruction : leverage existing tables without redesign.

Scalable : support incremental additions via simple documents.

High quality : focus on core assets to avoid noisy inputs.

Key metadata captured for each data request includes metric definition, data granularity, data range, and table/column details. This information directly maps to the components required for SQL generation (Metric Logic, Entity, Attribute, Table, Columns).

Maintenance Approach

Initially, knowledge was maintained in DingTalk documents (quick updates for core assets). As usage grew, a semi‑automated platform was built to store metadata in a database, provide a UI for editing, and synchronize changes back to the data warehouse.

Platform features:

Table registration with automatic parsing of ODPS metadata.

Field management with granularity and primary‑key flags.

Entity, attribute, and metric entry forms with validation and duplicate checks.

These controls dramatically reduced manual effort and ensured consistent, searchable knowledge.

GraphRAG Integration

Traditional Retrieval‑Augmented Generation (RAG) struggled with multi‑table relationships and hallucinated entities. By constructing a structured knowledge graph that captures table‑field relationships, entities, and metrics, GraphRAG improves recall accuracy and provides explainable reasoning paths.

Query processing steps:

Tokenize and map user query to entities using a fuzzy‑word and formula library.

Identify anchor metrics, locate candidate tables, and select top‑K tables covering all target entities.

Extract the minimal sub‑graph (shortest paths) that connects the required entities.

Agent Framework

Matra’s core algorithm does not fine‑tune the underlying LLM; instead, it orchestrates prompt engineering, knowledge‑graph retrieval, and execution modules. Three main challenges were addressed:

Accurate intent recognition and table recall.

Robust NL‑to‑SQL translation.

Reliable execution of multi‑table, complex tasks.

Solution components:

Intent + Graph module : parses user intent and retrieves relevant tables.

ReAct framework : guides the LLM through step‑wise reasoning, validates generated SQL syntax and semantics.

Plan & Execute framework : decomposes tasks into atomic sub‑queries, schedules execution, and handles re‑planning on errors.

Execution flow:

User input → intent parsing.

Core agent triggers Plan & Execute planner.

Planning node creates a step‑wise plan (e.g., filter date, join dimension).

Execute node runs sub‑agents ( data_collector, sql_executor) to fetch data and generate SQL.

If errors occur, a Replan node revises the plan.

Summarize node validates results and produces output (Markdown table or Excel file).

Practical Results

Since August, the system has handled thousands of queries from data engineers and analysts with >85% accuracy for asset‑lookup queries and >75% accuracy for complex analytical requests. Example successes include correct table identification, accurate metric calculations, and automated Excel exports.

Future Roadmap

Key focus areas:

Improve recall precision : expand a case‑library of high‑frequency SQL templates and enhance semantic matching.

Knowledge freshness : implement event‑driven updates from metadata changes, enforce pre‑/mid‑/post‑process knowledge maintenance, and explore automated knowledge extraction via code analysis.

Quality scoring : introduce scoring for tables, fields, metrics, and dimensions to prioritize high‑quality assets and avoid "data poisoning".

By continuously refining the knowledge base, expanding the graph, and tightening the agent orchestration, the goal is to evolve Matra from a usable prototype to a trustworthy, self‑evolving AI data platform.