How to Conduct Platform‑Based Large Model Evaluation with PAI
This guide explains how to use Alibaba Cloud PAI to prepare datasets, select open‑source or fine‑tuned models, create evaluation tasks, configure resources, view detailed metrics such as ROUGE and BLEU, and compare results across multiple models for both custom and public datasets.
Background
In the era of large models, systematic evaluation is essential for measuring performance, guiding model selection, and accelerating AI innovation. PAI provides a platform‑based best‑practice for large‑model evaluation.
Best‑Practice Overview
The practice covers:
Preparing and selecting evaluation datasets (public like MMLU, C‑Eval, or custom enterprise data).
Choosing suitable open‑source or fine‑tuned models.
Creating evaluation tasks and selecting appropriate metrics.
Interpreting results in single‑task or multi‑task scenarios.
Platform Highlights
End‑to‑end evaluation chain without code development; supports mainstream open‑source models and fine‑tuned versions.
Custom dataset upload with 10+ built‑in NLP metrics; one‑click result visualization.
Supports many public datasets with official metrics and radar‑chart display.
Multi‑model, multi‑task simultaneous evaluation with chart‑based comparison.
Transparent, reproducible results; evaluation code open‑sourced at .
Scenario 1: Custom Dataset Evaluation for Enterprise Developers
Prerequisites
Activate PAI and create a default workspace; if using custom datasets, create an OSS bucket for storage.
Prepare Custom Evaluation Set
Provide a JSONL file where question denotes the question column and answer denotes the answer column. Example:
[{"question": "中国发明了造纸术,是否正确?", "answer": "正确"}]Upload the file to OSS and create a dataset via the AI Asset Management console.
Select a Model
In the PAI console, navigate to Quick Start → Model List , browse model descriptions, and click the Evaluate button for supported models (HuggingFace AutoModelForCausalLM).
Create Evaluation Task
On the model detail page, click Evaluate and configure the task:
Dataset: choose the custom dataset created earlier.
Result output path: specify an OSS location.
Resource type: General Compute .
Resource group: Public Resource .
Task resource: for ~7B models, ecs.gn5-c8g1.4xlarge (16 vCPU, 120 GiB, NVIDIA P100 × 2) is recommended.
After creation, resources are allocated automatically and the task runs to completion. The status changes to Success .
View Evaluation Results
In Task Management → Model Evaluation , click View Report . The report shows radar charts for metrics such as ROUGE‑1‑F, ROUGE‑2‑P, BLEU‑1‑4, and detailed per‑sample scores. Results are saved to the specified OSS path.
Scenario 2: Public Dataset Evaluation for Algorithm Researchers
Researchers typically use public datasets to benchmark open‑source or fine‑tuned models. PAI integrates datasets such as MMLU, TriviaQA, HellaSwag, GSM8K, C‑Eval, CMMLU, and TruthfulQA, each with official metrics.
Supported Public Datasets
MMLU (166 MB, 14 042 samples, knowledge domain)
TriviaQA (14.3 MB, 17 944 samples, knowledge)
C‑Eval (1.55 MB, 12 342 samples, Chinese)
CMMLU (1.08 MB, 11 582 samples, Chinese)
GSM8K (4.17 MB, 1 319 samples, mathematics)
HellaSwag (47.5 MB, 10 042 samples, reasoning)
TruthfulQA (0.284 MB, 816 samples, safety)
Select a Model and Create Task
Follow the same steps as in Scenario 1 to locate a model, click Evaluate , and create a task using a public dataset (e.g., MMLU). Configure resources similarly.
Result Visualization and Comparison
After the task finishes, the report displays radar charts for domain‑level scores (averaged across datasets in the same domain) and individual dataset scores. Multiple tasks can be compared on a single page by selecting tasks in the Model Evaluation list and clicking Compare .
Cost Considerations
PAI Quick Start itself is free, but evaluation tasks may incur DLC charges. Using OSS for custom dataset storage also incurs standard OSS fees. Refer to the respective billing documentation for details.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
