Artificial Intelligence 21 min read

Unlock Massive Data with AI: MaxFrame’s AI Function Makes LLM-Powered Analytics Easy

This article introduces MaxFrame’s AI Function on Alibaba Cloud’s MaxCompute platform, detailing how built‑in large language models like Qwen 2.5 and DeepSeek‑R1 enable seamless text classification, information extraction, summarization, and more through simple Python APIs and distributed processing.

Alibaba Cloud Developer

Mar 27, 2025

Unlock Massive Data with AI: MaxFrame’s AI Function Makes LLM-Powered Analytics Easy

Background

Large language models (LLMs) are rapidly evolving and reshaping how we analyze, process, and use data, creating new opportunities across industries. However, selecting, deploying, and using these models often requires significant technical expertise and development cost, limiting users' ability to batch‑process massive datasets with AI.

MaxFrame is a distributed computing solution built on Alibaba Cloud’s MaxCompute platform for the Data + AI domain. Leveraging MaxCompute’s industry‑leading query engine, elastic compute, and massive storage, MaxFrame offers a Pandas‑compatible DataFrame layer that enables agile, efficient data cleaning, machine‑learning training, and offline model inference using familiar Python tools, delivering strong performance and cost‑effectiveness in typical user scenarios.

To make LLM capabilities ubiquitous on the big‑data platform and lower the barrier for AI‑driven data processing, MaxFrame has launched the AI Function feature. It ships ready‑to‑use models such as Qwen 2.5 and DeepSeek‑R1‑Distill‑Qwen, so users can call a simple programming interface without handling complex model deployment, and apply these models to massive MaxCompute tables for offline processing. Typical use cases include extracting structured information from text, summarization, translation, text quality assessment, and sentiment classification, greatly simplifying data‑processing pipelines and improving result quality.

AI Function Overview

The AI Function provides a straightforward generate interface that accepts a model type and uses a table and prompts as parameters. MaxFrame first partitions the table data, sets appropriate concurrency based on data size, and launches a worker group. Each worker renders the user‑provided prompt template for each row, builds the model input, invokes the locally hosted LLM for inference, and writes the inference result and status back to the MaxCompute table.

Overall architecture and workflow are illustrated below:

MaxFrame currently supports the following built‑in models:

Qwen 2.5 series: Qwen2.5‑7B‑instruct, Qwen2.5‑3B‑instruct, Qwen2.5‑1.5B‑instruct, Qwen2.5‑0.5B‑instruct

DeepSeek‑R1‑Distill‑Qwen series: DeepSeek‑R1‑Distill‑Qwen‑14B, DeepSeek‑R1‑Distill‑Qwen‑7B, DeepSeek‑R1‑Distill‑Qwen‑1.5B

These models are hosted offline on the MaxCompute platform, eliminating the need for users to download, distribute, or manage API concurrency limits. Consequently, AI Function jobs can fully exploit MaxCompute’s massive compute resources, achieving high token throughput and parallelism for large‑scale text processing.

In addition to the built‑in models, AI Function can invoke LLM APIs from Alibaba Cloud’s DashScope platform (e.g., QwenMax, DeepSeek R1 full‑size models). Users can obtain a DashScope API key, configure rate‑limiting policies, and set the API in AI Function.

Usage

Below is a simple Q&A example demonstrating how to use the AI Function interface.

Prerequisites:

Enable MaxCompute

Install the latest MaxFrame client: pip install maxframe -U Start a MaxFrame session:

import os
from maxframe import new_session
from odps import ODPS
import logging
logging.basicConfig(level=logging.INFO)

o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='your-default-project',
    endpoint='your-end-point'
)
session = new_session(odps_entry=o)

Create a DataFrame with sample queries and generate answers using a Qwen 2.5 model:

import maxframe.dataframe as md
query_list = [
    "What is the average distance from Earth to the Sun?",
    "When did the American Revolutionary War start?",
    "What is the boiling point of water?",
    "How can I quickly relieve a headache?",
    "Who is the main character in the Harry Potter series?"
]
df = md.DataFrame({"query": query_list})

from maxframe.learn.contrib.llm.models.managed import ManagedTextLLM
llm = ManagedTextLLM(name="qwen2.5-1.5b-instruct")
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Please answer the following question: {query}"}
]
result_df = llm.generate(df, prompt_template=messages)
print(result_df.execute())

The generate method sends the DataFrame and prompts to the LLM; calling execute triggers computation on the MaxCompute cluster, automatically handling data sharding and parallel execution.

Application Cases

Three real‑world scenarios illustrate the power of MaxFrame AI Function.

Text Risk‑Control Classification – DeepSeek‑R1‑Distill‑Qwen

The DeepSeek‑R1‑Distill‑Qwen model provides deep reasoning capabilities, enabling text classification, sentiment analysis, and quality assessment with detailed chain‑of‑thought outputs—crucial for risk‑control tasks involving complex semantics.

Using the model on the civil_comments dataset (Google Civil Comments) demonstrates batch risk classification:

df = md.read_odps_table("civil_comments", index_col="index").head(10)
print(df.execute())

After generating risk scores, the reasoning chain is stored in the reasoning_content column and can be extracted via flatjson for further analysis.

# Merge original text with inference results
merged_df = md.merge(df, result_df, left_index=True, right_index=True)
merged_df["content"] = merged_df.response.mf.flatjson(["$.choices[0].message.content"], dtype=np.dtype(np.str_))
merged_df["reasoning_content"] = merged_df.response.mf.flatjson(["$.choices[0].message.reasoning_content"], dtype=np.dtype(np.str_))
print(merged_df[["text", "content", "reasoning_content"]].execute())

Structured Information Extraction – Qwen 2.5

This case shows how AI Function can extract structured resume data from unstructured text using a JSON schema.

resumes = ["... (sample resume text) ..."]
df = md.DataFrame({"text": resumes})

from maxframe.learn.contrib.llm.models.managed import ManagedTextLLM
llm = ManagedTextLLM(name="qwen2.5-7b-instruct")
messages = [
    {"role": "system", "content": "You are an expert at extracting resume information into JSON with fields name, education, workExperience."},
    {"role": "user", "content": "Extract information from the following resume:
```{text}```"}
]
result = llm.generate(df, prompt_template=messages, params={
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "education": {"type": ["string", "array"]},
                "workExperience": {"type": "array", "items": {"type": "object", "properties": {"position": {"type": "string"}, "company": {"type": "string"}, "duration": {"type": "string"}}}}
            },
            "required": ["education", "workExperience"]
        }
    },
    "repeat_penalty": 1.2
})
print(result.execute())

Text Summarization – Qwen 2.5

Using the same model to generate concise summaries and outlines for long documents:

from maxframe.learn.contrib.llm.models.managed import ManagedTextLLM
llm = ManagedTextLLM(name="qwen2.5-7b-instruct")
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Summarize the entire text in under 200 words and provide an outline."}
]
result = llm.generate(df, prompt_template=messages)
summary = result.response.mf.flatjson(["$.output.choices[0].message.content"], dtypes=pd.Series([np.str_], index=["content"]))
print(summary.execute().fetch())

Conclusion and Outlook

Through these cases, MaxFrame’s AI Function demonstrates a powerful, flexible solution for text risk classification, structured information extraction, and summarization, all with minimal code and without worrying about model deployment, scaling, or maintenance. Its easy‑to‑use interface and distributed compute capabilities enable rapid AI‑driven data processing, boosting productivity while reducing costs.

Future enhancements will include multi‑modal built‑in models for image, audio, and video processing, support for custom fine‑tuned models, integration with Alibaba Cloud PAI for large‑parameter models, and richer debugging tools to help users craft effective prompts.

Stay tuned for upcoming features!

Python data processing MaxCompute AI Function MaxFrame

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

AI Function Overview

Usage

Application Cases

Text Risk‑Control Classification – DeepSeek‑R1‑Distill‑Qwen

Structured Information Extraction – Qwen 2.5

Text Summarization – Qwen 2.5

Conclusion and Outlook

Alibaba Cloud Developer

How this landed with the community

Was this worth your time?

0 Comments

Structured Information Extraction – Qwen 2.5

Text Summarization – Qwen 2.5