Industry Insights 13 min read

How DataWorks Is Transforming Big Data Development with AI Agents

The article outlines DataWorks' evolution from a decade‑long big‑data governance platform to an AI‑driven Copilot and autonomous Agent system, detailing its technical foundations, tool‑adaptation layer, context engineering, security safeguards, and future vision of a professional, open, and intelligent big‑data development ecosystem.

DataFunSummit
DataFunSummit
DataFunSummit
How DataWorks Is Transforming Big Data Development with AI Agents

Platform foundation

DataWorks is a one‑stop intelligent big‑data development and governance platform built on Alibaba Cloud ODPS (MaxCompute). After ten years of continuous development it serves as the native data‑development and governance core for the Alibaba Cloud ecosystem. The platform has earned several industry recognitions, including IDC 2024 leadership in the Chinese big‑data and data‑governance markets, placement in the IDC MarketScape Leaders quadrant for "Data Infrastructure for Generative AI", and an "Advanced" (Level 3) certification from the China Academy of Information and Communications Technology.

Copilot era – AI‑assisted efficiency

In response to the 2023 release of GPT‑4 and GitHub Copilot, DataWorks launched its Copilot project in March 2023. After nine months of internal testing, a public beta was opened in October 2024. The Copilot is tightly integrated with the VS Code‑based SQL editor and offers on‑demand code suggestions that can be accepted with a Tab key.

Core capabilities include:

SQL generation from natural‑language prompts

SQL completion and line‑by‑line suggestion

SQL rewrite/optimization

Error correction and syntax fixing

Explanation of generated statements

Automatic annotation of code

The model achieved first place on the Spider2.0 NL2SQL leaderboard, demonstrating strong NL‑to‑SQL performance. Operational metrics show more than 60 000 daily active users, over 5 million generated SQL lines adopted by users, and a reported 30 % boost in data‑development efficiency.

Agent era – From assistance to autonomy

Traditional big‑data development requires users to master SQL, manually construct workflows, and switch among engines such as MaxCompute, Hologres, StarRocks, and Spark. DataWorks Agent reduces this to a single natural‑language request. The system then:

Parses the user intent.

Plans the data pipeline and task dependencies.

Generates the required SQL/DDL code and assembles a production‑grade workflow.

Configures scheduling and prepares the release package.

Users only need to confirm authorization at key steps; the resulting workflow can be executed by non‑technical users. An end‑to‑end example – automatically building a product‑sales analysis workflow – illustrates the process:

Provide a natural‑language description of the analysis goal (e.g., "show daily sales trends for each product category").

Agent extracts the required tables, creates a layered SQL model, builds a DAG of dependent tasks, and selects the appropriate compute engine.

Agent generates the workflow definition, sets up daily scheduling, and presents a highlighted diff for user approval before deployment.

This approach makes complex multi‑engine pipelines transparent and lowers the skill barrier for business analysts.

Technical foundations – Building a professional, trustworthy, open Agent ecosystem

1. Platform base – VS Code as the core IDE

DataWorks adopts VS Code as the primary development environment, offering:

Notebook and Python support for data‑science and algorithm development.

A rich enterprise‑grade plugin ecosystem that works both online and offline.

Deep integration with Alibaba Cloud PAI and other AI platforms, enabling seamless model training and inference.

Compatibility with multiple data sources (MaxCompute, Hologres, Paimon, etc.).

2. Tool‑adaptation layer – Semantic mapping + engineering transformation

To let large models operate a platform with thousands of APIs, DataWorks decouples raw APIs from the Agent:

Schema layer: defines concise, semantically clear tool interfaces (name, parameters, description) that are easy for LLMs to understand.

Execution layer: implements authentication, compatibility handling, error management, and actual API calls.

MCP server: provides a unified protocol for tool registration and invocation, allowing multiple agents to share a common backend.

3. Context engineering – Multi‑dimensional context awareness

DataWorks collects six core environment signals (VS Code for Web, user identity, workspace, engine language, timestamp, system rules) and optionally seven additional context items (current code context, data catalog, documentation, business rules, etc.). This rich context enables token‑level response times under 1.5 seconds even after 80 dialogue rounds, ensuring accurate intent interpretation and precise code generation.

4. Safety red line – Human‑in‑the‑Loop governance

All Agent actions are bound to the current user’s login state, preventing unauthorized access. Production‑grade code changes generate a highlighted diff that must be manually approved before deployment. High‑risk operations such as publishing to production also require explicit user confirmation, guaranteeing a strict “Human‑in‑the‑Loop” safety barrier.

Future outlook – A professional, open, intelligent big‑data development platform

DataWorks aims to embed expert big‑data knowledge into Agent decision‑making, expose the standardized MCP tool protocol for A2A (Agent‑to‑Agent) collaboration, and support customizable Sub‑Agents for domain‑specific extensions. The ultimate goal is a “requirement‑as‑code” experience where a single natural‑language description can drive the entire data‑development lifecycle—from data ingestion and processing to monitoring and operations—making big‑data development as simple and efficient as possible.

big datacloud computingAgentGenerative AIDataWorksindustry insightsAI Copilot
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.