Big Data 10 min read

Build a Scalable AI Data Pipeline Using DataWorks, MaxCompute & MaxFrame

This guide walks you through setting up a secure, elastic, and high‑performance AI data processing platform on Alibaba Cloud by combining DataWorks, MaxCompute, and MaxFrame, covering the four essential steps, code examples, best‑practice tips, and common troubleshooting advice.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Build a Scalable AI Data Pipeline Using DataWorks, MaxCompute & MaxFrame

Solution Overview

Combining Alibaba Cloud DataWorks, MaxCompute, and MaxFrame provides a managed, secure, and scalable environment for AI data preparation, processing, and model training.

Component Roles

DataWorks : visual development, task scheduling, data governance, Notebook support, unified entry for collaborative projects.

MaxCompute : petabyte‑scale storage and compute, high reliability, low cost; serves as the data foundation.

MaxFrame : Pandas‑compatible distributed Python framework for processing massive datasets.

Key Advantages

Fully managed services; no need to build or maintain clusters.

Jupyter‑style Notebook with Magic Command for quick connection to compute resources.

Built‑in security via RAM permissions, VPC isolation, and data encryption.

Pay‑as‑you‑go or subscription pricing.

Four‑Step Environment Setup

Create a MaxCompute project

Select a region that matches your business (e.g., China East 2 – Shanghai).

Use the pay‑as‑you‑go billing mode (new users receive free quota).

Name the project meaningfully, e.g., ai_dedup_01.

Create a DataWorks workspace

Choose the basic edition (free Notebook).

Create a general‑purpose resource group (pay‑as‑you‑go) and bind a VPC if you need to access OSS, PAI, or other internal services.

Assign a workspace administrator and add team members (RAM sub‑accounts are supported).

Enable Data Studio for an improved Notebook experience.

Bind compute resources

In the workspace management page, bind the MaxCompute project created in step 1.

Select an appropriate resource group for task scheduling and Notebook execution.

Test connectivity to ensure permissions and network configuration are correct.

Launch a personal development instance

Create a personal development instance in Data Studio.

Choose CPU specifications (e.g., 4 vCPU / 16 GiB) and a pre‑installed image such as dataworks-maxcompute:py3.11-ubuntu20.04-202504-1.

After the instance starts, connect to MaxFrame directly from the Notebook for distributed computation.

Instances are billed by CU·hour; stop them when not in use.

Develop with MaxFrame

The following example shows Pandas‑style code that runs on billions of rows using MaxFrame.

import maxframe.dataframe as md
import pyarrow as pa
import pandas as pd
from maxframe.lib.dtypes_extension import dict_

# Initialize MaxFrame session (Magic Command auto‑connects MaxCompute)
mf_session = %maxframe

# Build a DataFrame (real data can come from MaxCompute tables)
col_a = pd.Series(
    data=[[('k1', 1), ('k2', 2)], [('k1', 3)], None],
    index=[1, 2, 3],
    dtype=dict_(pa.string(), pa.int64())
)
col_b = pd.Series(data=["A", "B", "C"], index=[1, 2, 3])

df = md.DataFrame({"A": col_a, "B": col_b})
df.execute()

# Custom function applied to each chunk
def custom_set_item(df):
    for name, value in df["A"].items():
        if value is not None:
            df["A"][name]["x"] = 100
    return df

# Distributed apply_chunk execution
result_df = df.mf.apply_chunk(
    custom_set_item,
    output_type="dataframe",
    dtypes=df.dtypes.copy(),
    batch_rows=2,
    skip_infer=True,
    index=df.index,
).execute().fetch()

print(result_df)

Highlighted Features

Magic Command %maxframe connects to compute resources without exposing AccessKey.

Logview links are included in the output, allowing one‑click inspection of job DAG, duration, and failure reasons.

Results can be written back to MaxCompute tables or exported to OSS, enabling seamless downstream model training.

Best‑Practice Recommendations

Leverage Logview 2.0

Each execution generates a visual job trace link for quick performance debugging.

Configure resource quotas wisely

Set options.session.quota_name to choose between post‑paid or pre‑paid quotas based on business needs.

Centralize sensitive information

Store AK/SK or database passwords in DataWorks workspace parameters and reference them in code via ${workspace.parameter_name} to avoid plaintext exposure.

Use DataMap for metadata governance

DataWorks automatically syncs MaxCompute table schemas, supporting lineage analysis, table preview, and lifecycle management.

Common Troubleshooting

Q: Notebook cannot find a MaxCompute table? A: Ensure the MaxCompute project is bound to the current DataWorks workspace, verify read permissions, and refresh metadata in DataMap.

Q: Unable to read/write OSS data? A: Confirm the RAM user has the appropriate bucket permissions and that the development instance and resource group are in the same VPC (or have public‑network access configured).

Further Reading

DataWorks Notebook development guide: https://help.aliyun.com/zh/dataworks/user-guide/notebook?spm=a2c4g.11186623.0.i4

MaxFrame official documentation: https://help.aliyun.com/zh/maxcompute/user-guide/maxframe-overview-1/?spm=a2c4g.11186623.help-menu-27797.d_2_5_2.35dc79994Wo1OE

MaxCompute free‑quota for new users: https://help.aliyun.com/zh/maxcompute/product-overview/free-quota-for-new-users?spm=a2c4g.11186623.0.0.130712f8Tf4uBB

big datacloud computingAIMaxComputeDataWorksnotebookMaxFrame
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.