DataWorks Data Agent Powers Taobao Live’s Shift from ETL to Fully Managed Data Development
The article analyzes how Taobao Live’s data team replaced traditional ETL bottlenecks with an AI‑native, three‑layer architecture built on DataWorks Data Agent, adopting NL2DSL2SQL, multi‑agent orchestration, and standardized pipelines to achieve near‑full automation and higher accuracy.
Traditional ETL bottlenecks such as a "human efficiency ceiling" and "experience islands" limited scalability; introducing large models did not remove these limits but shifted them, making accuracy the new red line, requiring a knowledge base as infrastructure, extending data consumption from static reports to AI‑direct calls, and moving collaboration from serial to parallel multi‑agent workflows.
Long‑term principle : constrain AI uncertainty with engineering determinism. The team built a three‑layer architecture:
Infrastructure layer : ensures AI is "well fed" with CDM, knowledge base, and engineering platform.
AI capability layer : ensures AI is "controllable" via Skill, SDD, and AI Coding.
Application layer : ensures AI is "well used" through multi‑agent collaboration and AI‑native central/local execution.
Core technical decision : instead of the common NL2SQL approach, the team chose NL2DSL2SQL. By inserting a domain‑specific language (DSL) expressed as JSON between natural language and SQL, they achieved three benefits: segment‑level debugging, pre‑audit, and reduced verification cost. When generated DSL violates platform rules, the system returns an error, locates the issue, and triggers AI self‑correction ( Error → Locate → AI Self‑correct), providing an explicit, auditable intermediate layer.
Architecture base : DataWorks Data Agent serves as the core, anchored by two runtime components:
Ontology : a symbolic layer that defines entities, relationships, and attributes (e.g., a global semantic for "host ID") to give AI an unambiguous knowledge reference.
Harness : the execution layer that schedules AI capabilities and manages context, ensuring session information is preserved across long‑chain tasks.
R&D paradigm reconstruction : the delivery process was split into two phases. The clarification phase builds AI‑friendly technical solutions with manual checks; the execution phase lets AI drive the entire implementation while humans perform key acceptance. Six standardized steps were defined: abstract technical方案, model construction (solidified metric‑SQL templates), fully automated code generation (prohibiting manual edits), online key‑information extraction (auto‑recognition of primary/foreign/distribution keys), automatic DDL changes, and monitoring (R&D SOP + DQC). This forms an end‑to‑end automated pipeline.
Multi‑agent collaboration creates a digital factory with clear roles: a planning agent decomposes requirements, functional agents execute tasks, and a reporting agent aggregates output. Human confirmation is retained at critical nodes, allowing humans to focus on high‑value decisions (data modeling, metric definitions, complex business translation) while AI handles deterministic execution.
Dual‑track evolution : a centralized AI‑native stack provides standardization and cross‑team collaboration, while a localized AI‑native stack introduces persistent memory per agent, enabling retrieval of historical specs for personalized, high‑frequency iteration. Both tracks coexist to balance standardization and customization.
Capability layer packaging :
Skill : a zero‑cost reusable asset derived from semantic layer assets and Agent transformations.
SDD (Spec‑Driven Development) : each stage produces a standard Spec file documenting requirements, resource gaps, model design, and I/O, ensuring context preservation, full traceability, pre‑review, and template execution.
AI Coding : AI manages context, traceability, and template execution, while humans retain responsibility for data modeling decisions and complex translations.
Code generation follows two parallel paths:
DSL path (NL → DSL → SQL): relies on a self‑built semantic layer, supports unit testing and engineering checks, covers >70% of code volume, and suits standardized long‑term requirements.
DataWorks Data Agent path : weakly depends on the semantic layer, only needs data sources, pseudo‑code, and CTE references, suitable for ad‑hoc queries and historical task modifications.
With standardized input, accuracy rose from 50% to 80%, achieving near‑100% code‑generation penetration and high‑quality delivery within 24 hours.
Infrastructure foundation :
OneData system : goes beyond traditional data‑warehouse layering, emphasizing AI‑native precise delivery; tables, fields, comments, enums, and code structures must be highly accurate and continuously refreshed.
Knowledge dual engine : LightRAG builds a graph‑based warehouse entity relationship network (solving "AI can find tables"), while a business wiki stores domain jargon, metric definitions, and enum dictionaries (solving "AI can understand human language").
Engineering platform : a self‑built live‑data service platform that packages ontology management, DSL validation, and Harness control as APIs, enabling dynamic orchestration via AI‑native methods.
Future evolution : the current Requirement‑to‑Code (R2C) pipeline is highly automated, but the next battle shifts from faster coding to smarter data consumption. The team plans to launch ChatBI, allowing AI to converse with business users, perform attribution analysis, and assist decision‑making, thus extending the paradigm from data development to real‑time data consumption.
For further technical details, see the DataWorks Data Agent product page and official documentation:
https://www.aliyun.com/product/dataworks/dataagent<br/>https://help.aliyun.com/zh/dataworks/user-guide/new-data-agent<br/>https://dataworks.data.aliyun.com/product/agent
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
