Industry Insights 20 min read

How Datus AI Is Redefining Data Engineering with an Open‑Source Data Agent

This article examines Datus AI’s open‑source Data Engineering Agent, detailing its architecture, interactive context engineering, evaluation results, and future roadmap, and explains how it tackles the challenges of scaling AI‑driven data workflows.

DataFunSummit

Apr 5, 2026

How Datus AI Is Redefining Data Engineering with an Open‑Source Data Agent

Datus AI is an open‑source Data Engineering Agent project that has been active for about two months and is already being tested by overseas companies such as LinkedIn, Expedia, and Coinbase, as well as several large domestic clients.

Why Data Engineering Agents Are Needed

Large‑scale production of data‑engineering agents faces several hurdles: traditional RAG (retrieval‑augmented generation) and fixed‑workflow approaches lack generalisation; hidden knowledge in real data warehouses is hard to extract and formalise, making problem definition and result evaluation imprecise; and building a continuous feedback and online‑learning loop for models to evolve from human corrections and multi‑turn interactions is essential for stable delivery.

Core Innovation – Iterative Context Engineering

Instead of merely creating another generic conversational Data Agent or Chat‑BI tool, Datus focuses on an iterative context‑engineering pipeline. The product acts as a "Copilot for data engineers", constructing a Data Context system that supports reliable agent operation and continuous optimisation in real workflows.

Expanded Role of Data Engineering in the AI Era

Data engineering now goes beyond building warehouses and delivering dashboards. It must provide natural‑language query interfaces for human users, API services for other agents, and align existing metrics and semantic layers with various BI tools. Industry pushes for standardised semantic layers to translate domain‑specific jargon into SQL, shifting engineers from pure SQL writing to building high‑quality, evolvable data contexts.

System Architecture

Adapter Layer : Connects multiple large language models, data warehouses (Snowflake, StarRocks, Trino, Redshift, etc.), metadata catalog services (Polaris, Metric Flow), and BI tools.

Core Context Engine : Aggregates metadata, metric definitions, historical SQL, and documentation into a unified data context, exposing capabilities via standardised tool interfaces.

Design Philosophy – Agentic Loop : Chooses an agent‑native interaction model over static workflows, leveraging continuous LLM improvements for better reuse and adaptability.

Tools & Integration : Primary interface is the Datus CLI, offering Copilot‑style coding assistance and the ability to create sub‑agents that can be embedded in BI sidebars (e.g., Superset).

Feedback Loop and Evaluation

The system builds a feedback loop where the LLM generates multiple answers to the same question, compares them, and extracts new knowledge (e.g., JOIN logic, null‑handling rules). This knowledge is fed back into the context store, improving future responses.

Benchmarking the built‑in NL2SQL framework shows that without context the LLM achieves ~50% accuracy, while injecting historical SQL and metric context raises accuracy to >80%. Parallel testing and consistency checks can further boost performance at the cost of extra compute.

Interactive Context Construction

During knowledge‑base initialisation, agents automatically read historical SQL to extract table schemas, data samples, and comments, storing them in a vector database with hybrid search. Structured context trees (catalog tree and subject tree) are generated, with the catalog tree mapping database objects and the subject tree organising business domains and associated metrics, SQL templates, and documentation.

Metrics are defined in YAML and exposed via APIs for direct querying or reusable SQL generation. The system can also parse BI dashboards (e.g., Superset) to extract underlying SQL, automatically deriving common metrics and dimensions for the context engine.

Standardised Tool Spec

Datus defines two core tool types for LLMs: list (enumeration) and search (lookup), enabling models to explore data assets similarly to a file system. Tools are wrapped with a Skills layer and communicated via an MCP protocol, allowing both native and protocol‑based implementations.

Sub‑Agent Architecture

Multiple Sub‑Agents compose the system, each representing a specific tool set. They are created and managed via the CLI, with dynamic scope contexts that select relevant sub‑trees from the global catalog/subject trees for a given scenario. This design supports flexible, high‑efficiency task execution and provides a clear boundary for future reinforcement‑learning optimisation.

Future Roadmap

Version 0.3 aims to integrate lake‑house components, unifying metadata, historical information, scheduling, and metrics into a comprehensive data context. The team plans to keep the project open‑source, focusing on community‑driven tool ecosystems rather than commercialisation.

Reinforcement learning is being explored for tasks with clear reward signals, such as NL2SQL, where correct execution provides a binary reward. Experiments with small 4B‑parameter models are underway, while larger models still require validation.

Design Principles

Context over Control : Prioritise high‑quality, iteratively improved data context rather than fixed, high‑accuracy workflows.

Simple and Reliable : Validate generated SQL, metrics, and configuration files with built‑in tools to ensure correctness.

Embrace Change : Provide standardised reinforcement‑learning environments and reward functions to accommodate future model upgrades.

Overall, Datus AI seeks to provide a standardised, open‑source tool ecosystem that enables data engineers to harness LLM capabilities efficiently, fostering community collaboration and continuous improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents open source Reinforcement Learning NL2SQL

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.