Big Data 10 min read

Build a Scalable AI Data Pipeline Using DataWorks, MaxCompute & MaxFrame

This guide walks you through setting up a secure, elastic, and high‑performance AI data processing platform on Alibaba Cloud by combining DataWorks, MaxCompute, and MaxFrame, covering the four essential steps, code examples, best‑practice tips, and common troubleshooting advice.

Alibaba Cloud Big Data AI Platform

Dec 31, 2025

Build a Scalable AI Data Pipeline Using DataWorks, MaxCompute & MaxFrame

Solution Overview

Combining Alibaba Cloud DataWorks, MaxCompute, and MaxFrame provides a managed, secure, and scalable environment for AI data preparation, processing, and model training.

Component Roles

DataWorks : visual development, task scheduling, data governance, Notebook support, unified entry for collaborative projects.

MaxCompute : petabyte‑scale storage and compute, high reliability, low cost; serves as the data foundation.

MaxFrame : Pandas‑compatible distributed Python framework for processing massive datasets.

Key Advantages

Fully managed services; no need to build or maintain clusters.

Jupyter‑style Notebook with Magic Command for quick connection to compute resources.

Built‑in security via RAM permissions, VPC isolation, and data encryption.

Pay‑as‑you‑go or subscription pricing.

Four‑Step Environment Setup

Create a MaxCompute project

Select a region that matches your business (e.g., China East 2 – Shanghai).

Use the pay‑as‑you‑go billing mode (new users receive free quota).

Name the project meaningfully, e.g., ai_dedup_01.

Create a DataWorks workspace

Choose the basic edition (free Notebook).

Create a general‑purpose resource group (pay‑as‑you‑go) and bind a VPC if you need to access OSS, PAI, or other internal services.

Assign a workspace administrator and add team members (RAM sub‑accounts are supported).

Enable Data Studio for an improved Notebook experience.

Bind compute resources

In the workspace management page, bind the MaxCompute project created in step 1.

Select an appropriate resource group for task scheduling and Notebook execution.

Test connectivity to ensure permissions and network configuration are correct.

Launch a personal development instance

Create a personal development instance in Data Studio.

Choose CPU specifications (e.g., 4 vCPU / 16 GiB) and a pre‑installed image such as dataworks-maxcompute:py3.11-ubuntu20.04-202504-1.

After the instance starts, connect to MaxFrame directly from the Notebook for distributed computation.

Instances are billed by CU·hour; stop them when not in use.

Develop with MaxFrame

The following example shows Pandas‑style code that runs on billions of rows using MaxFrame.

import maxframe.dataframe as md
import pyarrow as pa
import pandas as pd
from maxframe.lib.dtypes_extension import dict_

# Initialize MaxFrame session (Magic Command auto‑connects MaxCompute)
mf_session = %maxframe

# Build a DataFrame (real data can come from MaxCompute tables)
col_a = pd.Series(
    data=[[('k1', 1), ('k2', 2)], [('k1', 3)], None],
    index=[1, 2, 3],
    dtype=dict_(pa.string(), pa.int64())
)
col_b = pd.Series(data=["A", "B", "C"], index=[1, 2, 3])

df = md.DataFrame({"A": col_a, "B": col_b})
df.execute()

# Custom function applied to each chunk
def custom_set_item(df):
    for name, value in df["A"].items():
        if value is not None:
            df["A"][name]["x"] = 100
    return df

# Distributed apply_chunk execution
result_df = df.mf.apply_chunk(
    custom_set_item,
    output_type="dataframe",
    dtypes=df.dtypes.copy(),
    batch_rows=2,
    skip_infer=True,
    index=df.index,
).execute().fetch()

print(result_df)

Highlighted Features

Magic Command %maxframe connects to compute resources without exposing AccessKey.

Logview links are included in the output, allowing one‑click inspection of job DAG, duration, and failure reasons.

Results can be written back to MaxCompute tables or exported to OSS, enabling seamless downstream model training.

Best‑Practice Recommendations

Leverage Logview 2.0

Each execution generates a visual job trace link for quick performance debugging.

Configure resource quotas wisely

Set options.session.quota_name to choose between post‑paid or pre‑paid quotas based on business needs.

Centralize sensitive information

Store AK/SK or database passwords in DataWorks workspace parameters and reference them in code via ${workspace.parameter_name} to avoid plaintext exposure.

Use DataMap for metadata governance

DataWorks automatically syncs MaxCompute table schemas, supporting lineage analysis, table preview, and lifecycle management.

Common Troubleshooting

Q: Notebook cannot find a MaxCompute table? A: Ensure the MaxCompute project is bound to the current DataWorks workspace, verify read permissions, and refresh metadata in DataMap.

Q: Unable to read/write OSS data? A: Confirm the RAM user has the appropriate bucket permissions and that the development instance and resource group are in the same VPC (or have public‑network access configured).

Build a Scalable AI Data Pipeline Using DataWorks, MaxCompute & MaxFrame

Solution Overview

Component Roles

Key Advantages

Four‑Step Environment Setup

Develop with MaxFrame

Highlighted Features

Best‑Practice Recommendations

Leverage Logview 2.0

Configure resource quotas wisely

Centralize sensitive information

Use DataMap for metadata governance

Common Troubleshooting

Further Reading

Alibaba Cloud Big Data AI Platform

How this landed with the community

Was this worth your time?

0 Comments

Solution Overview

Component Roles

Key Advantages

Four‑Step Environment Setup

Develop with MaxFrame

Highlighted Features

Best‑Practice Recommendations

Leverage Logview 2.0

Configure resource quotas wisely

Centralize sensitive information

Use DataMap for metadata governance

Common Troubleshooting

Further Reading

Alibaba Cloud Big Data AI Platform

How this landed with the community

Was this worth your time?

0 Comments

Leverage Logview 2.0