How AI Agents Are Redefining Data Engineering: Expert Insights and Real‑World Practices

In a deep‑dive roundtable, three data‑engineering veterans discuss the rise of AI agents, the importance of data context, memory mechanisms, workflow versus agent trade‑offs, and the future of database intelligence, offering practical strategies and architectural philosophies for building smarter data pipelines.

DataFunSummit
DataFunSummit
DataFunSummit
How AI Agents Are Redefining Data Engineering: Expert Insights and Real‑World Practices

Guest Introduction

The round‑table featured three experts in data engineering and AI agents: Tang Qing, an OceanBase community specialist; Cui Jing, founder of AskTable and former Alibaba DBA; and Zhao Heng, former StarRocks core member and founder of Datu AI.

Why Data‑Engineering Agents?

All speakers agreed that the rapid rise of large language models creates a new opportunity to automate repetitive data‑engineering tasks such as SQL authoring, CSV cleaning, and dashboard generation. By moving the workload from junior analysts to AI agents, organizations can improve efficiency and reduce manual errors.

Core Value of AI Agents in Data Engineering

Agents excel at extracting structured insights from raw data, a niche where LLMs have strong performance. The discussion highlighted a shift from post‑processing tools to agents that can autonomously manage end‑to‑end data‑pipeline steps, though many use‑cases remain unexplored.

Market Practices: Overseas vs. Domestic

In mature cloud‑native ecosystems (e.g., Snowflake, Databricks), native data‑agent services and catalog APIs are common. In China, the focus is on “ChatBI” solutions and custom pipelines, with limited adoption of tools like DBT.

Data Context as the Foundation

Effective agents require rich contextual metadata. Datu AI organizes metadata into two trees:

Catalog Tree – a physical hierarchy (database → schema → table) used for structural lookup.

Subject Tree – a business‑oriented hierarchy (domain → topic → metric) that stores metrics, reference SQL snippets, and knowledge entries.

Agents can query these trees via native list, search, and database tools to retrieve the most relevant context.

Immersive Analysis and Interaction Paradigm

AskTable’s “AI Canvas” provides a two‑dimensional workspace where users drag Excel files, CSVs, or warehouse queries. The AI then generates Python or SQL code and creates visual dashboards, allowing flexible layout adjustments. Zhao Heng introduced “Vibe Programming” (Plan Mode): the agent first produces an execution plan, the user reviews and approves critical steps, and the agent carries out the plan. This balances controllability with automation.

Architecture Philosophy: Data‑Layer Intelligence

OceanBase’s integrated database stores vectors, structured rows, and full‑text indexes together, enabling the database to serve as a native context provider for agents. Datu AI follows three design principles:

Context > Control – provide rich context rather than tightly scripted commands.

Simple & Reliable – use deterministic workflows for routine tasks and agents for flexible, exploratory tasks.

Embrace Change – adopt post‑training techniques to quickly adapt to new SQL dialects and domains.

Accuracy as the Lifeline of Data Agents

Accuracy remains the biggest challenge. Strategies discussed include:

Targeting high‑tolerance scenarios (e.g., data‑development) where errors can be re‑run.

Implementing feedback loops that capture user corrections and turn them into parameter‑filled SQL templates.

Applying semantic‑layer transformations to convert generated SQL into safe, parameterized statements.

Ensuring strong data‑governance to prevent downstream contamination.

Memory and Learning Mechanisms

Three‑layer memory architecture (Power Memory) was described:

Short‑term dialogue memory – retains the current conversation context.

Long‑term knowledge base – stores user‑edited SQL patterns, reference queries, and business definitions (e.g., GMV calculations).

Scenario‑specific case library – keeps high‑value patterns for fast retrieval.

Both AskTable and Datu AI use feedback loops to continuously enrich these stores, allowing the agents to become more accurate over time.

Balancing Technical Ideals and Engineering Realities

The panel identified several trade‑offs:

Capability vs. response time – users want “anything” answered instantly, but complex tasks require longer processing. The solution is to expose fast‑response workflows for simple queries and reserve agents for deeper analysis.

Open‑ended vs. controllable – plan‑mode execution lets users approve a generated plan before the agent runs it, preserving flexibility while ensuring safety.

Generality vs. specialization – a generic coding agent covers many scenarios but lacks depth for data‑specific logic; a data‑focused agent provides richer domain knowledge at the cost of a narrower user base.

AI‑Native Transformation of the Database Ecosystem

OceanBase’s vision includes built‑in intelligent interfaces:

Context APIs that answer “what does this table do?” directly from the database.

Semantic query understanding that maps natural language to optimized execution plans.

Zhao Heng outlined a three‑layer integration model for databases:

Standard Catalog API for metadata discovery.

AI‑aware SQL optimization layer that rewrites LLM‑generated queries.

Future AI‑native layer that understands semantics and recommends indexes.

Future Outlook

Predictions for the next two years:

AI agents will become as ubiquitous as smartphones, moving from hype to essential efficiency tools.

Human roles will shift from executing tasks to supervising agents and providing strategic direction.

By late 2026, mature governance and memory technologies are expected to trigger a V‑shaped market rebound with large‑scale commercial deployments.

Conclusion

The discussion provided a concrete roadmap for integrating AI agents into data‑engineering workflows, covering context engineering, immersive analytics, memory design, workflow‑agent trade‑offs, and database intelligence. The insights combine deep technical expertise with practical deployment experience, offering actionable guidance for organizations seeking to accelerate AI‑driven data pipelines.

data engineeringContext EngineeringMemory SystemsDatabase IntelligenceImmersive AnalyticsWorkflow vs Agent
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.