Unlock AI-Powered Data Processing with MaxFrame’s AI Function
This article introduces MaxFrame’s AI Function, a new feature built on MaxCompute that integrates large language models like Qwen 2.5 and DeepSeek‑R1‑Distill‑Qwen to simplify model deployment and enable scalable text classification, information extraction, summarization, translation, and other AI-driven data processing tasks on massive datasets.
Background
Large language models (LLMs) are rapidly evolving and reshaping how we analyze, process, and use data, creating new opportunities across industries. However, selecting, deploying, and using these models often requires significant technical expertise and development cost, limiting users’ ability to batch‑process massive data with AI.
MaxFrame and AI Function Overview
MaxFrame is Alibaba Cloud’s distributed computing solution for the Data + AI domain, built on the MaxCompute platform. It offers a Pandas‑compatible DataFrame layer that enables agile, efficient data cleaning, machine‑learning training, and offline model inference using familiar Python APIs.
To make LLM capabilities universally accessible on the data platform, MaxFrame launches the AI Function feature. It ships ready‑to‑use models such as Qwen 2.5 and DeepSeek‑R1‑Distill‑Qwen, abstracting away model deployment complexities. Users invoke a simple generate interface, passing a table and prompts, and MaxFrame handles data partitioning, concurrency, and distributed inference on MaxCompute.
Supported Models
Qwen 2.5 series: 7B‑instruct, 3B‑instruct, 1.5B‑instruct, 0.5B‑instruct
DeepSeek‑R1‑Distill‑Qwen series: 14B, 7B, 1.5B
These models are hosted offline within MaxCompute, eliminating the need for users to download, distribute, or manage API rate limits. AI Function can also call external LLM APIs from Alibaba Cloud DashScope for larger or more diverse models.
Usage Example
Below is a minimal example that creates a DataFrame of questions, defines a prompt template, and generates answers using the Qwen 2.5‑1.5B‑instruct model.
import os
from maxframe import new_session
from odps import ODPS
import logging
logging.basicConfig(level=logging.INFO)
# Initialize ODPS client
o = ODPS(
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='your-default-project',
endpoint='your-end-point'
)
# Create MaxFrame session
session = new_session(odps_entry=o)
import maxframe.dataframe as md
query_list = [
"地球距离太阳的平均距离是多少?",
"美国独立战争是从哪一年开始的?",
"什么是水的沸点?",
"如何快速缓解头痛?",
"谁是《哈利·波特》系列中的主角?"
]
df = md.DataFrame({"query": query_list})
from maxframe.learn.contrib.llm.models.managed import ManagedTextLLM
llm = ManagedTextLLM(name="qwen2.5-1.5b-instruct")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "请回答如下问题:{query}"}
]
result_df = llm.generate(df, prompt_template=messages)
print(result_df.execute())The generate call renders the prompt for each row, runs inference on the selected model, and writes results back to a MaxCompute table.
Application Cases
Text Risk Classification (DeepSeek‑R1‑Distill‑Qwen)
Using the 14B DeepSeek model, the article demonstrates classification of comments from the civil_comments dataset into categories such as political, violent, pornographic, scam, rumor, or safe, while also extracting the model’s reasoning chain.
df = md.read_odps_table("civil_comments", index_col="index").head(10)
llm = ManagedTextLLM(name="DeepSeek-R1-Distill-Qwen-14B")
messages = [{"role": "user", "content": "请作为文本风控评估员,对以下互联网文本进行健康程度评估,并给出理由...
{ text }"}]
result_df = llm.generate(df, prompt_template=messages)
merged_df = md.merge(df, result_df, left_index=True, right_index=True)
merged_df["content"] = merged_df.response.mf.flatjson(["$.choices[0].message.content"], dtype=np.str_)
merged_df["reasoning_content"] = merged_df.response.mf.flatjson(["$.choices[0].message.reasoning_content"], dtype=np.str_)
print(merged_df[["text", "content", "reasoning_content"]].execute())Resume Information Extraction (Qwen 2.5)
By defining a JSON schema and using the response_format parameter, the AI Function extracts structured name, education, and work‑experience fields from synthetic resume texts.
llm = ManagedTextLLM(name="qwen2.5-7b-instruct")
messages = [
{"role": "system", "content": "你是一个简历信息提取专家..."},
{"role": "user", "content": "从如下简历中抽取信息
```{text}```"}
]
params = {"response_format": {"type": "json_object", "schema": {...}}}
result = llm.generate(df, prompt_template=messages, params=params)
print(result.execute())Text Summarization (Qwen 2.5)
The model generates a concise summary (<200 words) and outline for long documents, demonstrating its utility for content condensation.
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "请概括全文,总结为 200 字以内的摘要,并为文本生成大纲。"}
]
result = llm.generate(df, prompt_template=messages)
summary = result.response.mf.flatjson(["$.output.choices[0].message.content"], dtypes=pd.Series([np.str_], index=["content"]))
print(summary.execute().fetch())Benefits
Easy to use: import data with read_odps_table, define prompts, and call generate to run distributed inference with a few lines of code.
Low operational cost: models are managed offline on MaxCompute, removing the need for manual deployment, monitoring, and scaling.
Future Outlook
Planned enhancements include multimodal built‑in models, support for user‑uploaded fine‑tuned models, integration with Alibaba Cloud PAI for large‑parameter models, and richer debugging tools for prompt engineering.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
