Artificial Intelligence 14 min read

How to Conduct Platform‑Based Large Model Evaluation with PAI

This guide explains how to use Alibaba Cloud PAI to prepare datasets, select open‑source or fine‑tuned models, create evaluation tasks, configure resources, view detailed metrics such as ROUGE and BLEU, and compare results across multiple models for both custom and public datasets.

Alibaba Cloud Big Data AI Platform

Jun 19, 2024

How to Conduct Platform‑Based Large Model Evaluation with PAI

Background

In the era of large models, systematic evaluation is essential for measuring performance, guiding model selection, and accelerating AI innovation. PAI provides a platform‑based best‑practice for large‑model evaluation.

Best‑Practice Overview

The practice covers:

Preparing and selecting evaluation datasets (public like MMLU, C‑Eval, or custom enterprise data).

Choosing suitable open‑source or fine‑tuned models.

Creating evaluation tasks and selecting appropriate metrics.

Interpreting results in single‑task or multi‑task scenarios.

Platform Highlights

End‑to‑end evaluation chain without code development; supports mainstream open‑source models and fine‑tuned versions.

Custom dataset upload with 10+ built‑in NLP metrics; one‑click result visualization.

Supports many public datasets with official metrics and radar‑chart display.

Multi‑model, multi‑task simultaneous evaluation with chart‑based comparison.

Transparent, reproducible results; evaluation code open‑sourced at .

Scenario 1: Custom Dataset Evaluation for Enterprise Developers

Prerequisites

Activate PAI and create a default workspace; if using custom datasets, create an OSS bucket for storage.

Prepare Custom Evaluation Set

Provide a JSONL file where question denotes the question column and answer denotes the answer column. Example:

[{"question": "中国发明了造纸术，是否正确？", "answer": "正确"}]

Upload the file to OSS and create a dataset via the AI Asset Management console.

Select a Model

In the PAI console, navigate to Quick Start → Model List , browse model descriptions, and click the Evaluate button for supported models (HuggingFace AutoModelForCausalLM).

Create Evaluation Task

On the model detail page, click Evaluate and configure the task:

Dataset: choose the custom dataset created earlier.

Result output path: specify an OSS location.

Resource type: General Compute .

Resource group: Public Resource .

Task resource: for ~7B models, ecs.gn5-c8g1.4xlarge (16 vCPU, 120 GiB, NVIDIA P100 × 2) is recommended.

After creation, resources are allocated automatically and the task runs to completion. The status changes to Success .

View Evaluation Results

In Task Management → Model Evaluation , click View Report . The report shows radar charts for metrics such as ROUGE‑1‑F, ROUGE‑2‑P, BLEU‑1‑4, and detailed per‑sample scores. Results are saved to the specified OSS path.

Scenario 2: Public Dataset Evaluation for Algorithm Researchers

Researchers typically use public datasets to benchmark open‑source or fine‑tuned models. PAI integrates datasets such as MMLU, TriviaQA, HellaSwag, GSM8K, C‑Eval, CMMLU, and TruthfulQA, each with official metrics.

Supported Public Datasets

MMLU (166 MB, 14 042 samples, knowledge domain)

TriviaQA (14.3 MB, 17 944 samples, knowledge)

C‑Eval (1.55 MB, 12 342 samples, Chinese)

CMMLU (1.08 MB, 11 582 samples, Chinese)

GSM8K (4.17 MB, 1 319 samples, mathematics)

HellaSwag (47.5 MB, 10 042 samples, reasoning)

TruthfulQA (0.284 MB, 816 samples, safety)

Select a Model and Create Task

Follow the same steps as in Scenario 1 to locate a model, click Evaluate , and create a task using a public dataset (e.g., MMLU). Configure resources similarly.

Result Visualization and Comparison

After the task finishes, the report displays radar charts for domain‑level scores (averaged across datasets in the same domain) and individual dataset scores. Multiple tasks can be compared on a single page by selecting tasks in the Model Evaluation list and clicking Compare .

Cost Considerations

PAI Quick Start itself is free, but evaluation tasks may incur DLC charges. Using OSS for custom dataset storage also incurs standard OSS fees. Refer to the respective billing documentation for details.

model comparison PAI large-model-evaluation AI metrics custom-dataset public-dataset

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Best‑Practice Overview

Platform Highlights

Scenario 1: Custom Dataset Evaluation for Enterprise Developers

Prerequisites

Prepare Custom Evaluation Set

Select a Model

Create Evaluation Task

View Evaluation Results

Scenario 2: Public Dataset Evaluation for Algorithm Researchers

Supported Public Datasets

Select a Model and Create Task

Result Visualization and Comparison

Cost Considerations

Alibaba Cloud Big Data AI Platform

How this landed with the community

Was this worth your time?

0 Comments

Scenario 1: Custom Dataset Evaluation for Enterprise Developers

Scenario 2: Public Dataset Evaluation for Algorithm Researchers