Artificial Intelligence 24 min read

How to Systematically Conduct Large Model Evaluation in Real-World Scenarios

This guide walks readers through a complete, business‑oriented workflow for evaluating large language models—from requirement analysis and test‑set design to metric definition, execution, result aggregation, and report generation—while addressing common challenges such as data imbalance, annotation quality, and automation.

Alibaba Cloud Developer

Jun 23, 2025

How to Systematically Conduct Large Model Evaluation in Real-World Scenarios

Overview

This document systematically introduces how to carry out large‑model evaluation in practical business scenarios, helping readers understand and master the end‑to‑end process: requirement analysis, evaluation‑set design and generation, metric setting, task execution, and report output.

Background

Although many articles discuss large‑model evaluation, they often resemble success‑story biographies that do not explain step‑by‑step implementation; existing platforms rarely show how to upload evaluation sets or add new dimensions, so this guide provides a "user manual"‑style approach.

Definition

The scope of large‑model evaluation is to comprehensively and quantitatively assess model capabilities through well‑designed tasks and datasets, focusing on business‑level effects rather than raw performance benchmarks, which are handled by separate stress‑testing tools.

Scenarios

Model launch : evaluate capabilities before production to decide if the model is ready.

Model upgrade or switch : compare old and new models when changing providers, sizes, or fine‑tuned versions.

Model optimization : use bad‑case analysis to improve knowledge bases, prompts, workflows, or fine‑tuning.

Evaluation Tools and Platforms

Evaluation dimensions : design dimensions that best reflect model effectiveness and drive improvement.

Evaluation sets : create sets that simulate real‑world usage, balancing scenario coverage.

Annotation : address annotator quality variance and high manual cost.

Business changes : adapt to evolving business requirements and model versions.

Evaluation Methodology

The evaluation process is divided into four stages with nine actions. The first seven actions focus on producing an evaluation report; the last two actions support continuous model improvement and eventual deployment decisions.

Requirement Analysis

Identify key business questions such as which business the model serves, the involved processes, user roles, problem statements, and whether the model is already in production.

Evaluation‑Set Design & Generation

Based on Evaluation Goal, Identify Data Sources

Typical evaluation‑set types include:

End‑to‑end set : real online logs (historical or double‑run) to reflect actual scenario performance.

Layered set : bad‑case regression, functional module tests, and security checks.

Confirm Scenario Scope

Combine business‑scenario analysis and technical‑architecture analysis to ensure all relevant flows are covered.

Confirm Evaluation‑Set Size

For online double‑run sets, consider duration (e.g., 3‑7 days) and traffic patterns; for offline sets, balance coverage, sample size, and cost.

Evaluation‑Set Draft Generation

Methods include historical request extraction, bad‑case logs, manual creation, and large‑model auto‑generation (few‑shot prompting). Each method’s pros and cons are discussed.

Evaluation‑Set Optimization

Data security desensitization.

Adjusting distribution imbalance.

Continuous iteration using new bad cases.

Evaluation‑Dimension Design

Define whether a single unified metric or multiple dimensions are needed, the number of rating levels, and the criteria for each level. Examples range from binary correctness to multi‑level scoring for translation quality (faithfulness, fluency, adequacy).

Model‑Effect Quantification

Aggregate dimension scores using weighted sums to produce an overall model score, allowing comparison across models or versions.

Evaluation‑Task Design & Execution

Evaluation mode : manual, automatic (AI‑based or metric‑based), baseline.

Evaluation type : single‑model vs. comparative.

AI Evaluation Prompt Example

你的任务是对AI智能助手回复进行质量评分。
你非常清晰地认识到当用户提出一个关于【${scene}】场景的指令时（该场景的定义为：${scene_desc}），一个AI智能助手的回复应当符合以下标准（按标准重要性程度从高到低依次给出）：
[标准开始]
${metric}
[标准结束]
评分采取${max_score}档制（1-${max_score}），各分数档位含义如下：
[档位含义开始]
${score_desc}
[档位含义结束]
...（后续省略）

Evaluation Report

The report summarizes environment, results (tables/figures), conclusions, and next steps such as evaluation‑plan refinement, bad‑case remediation, and deployment decisions.

Next Actions

Evaluation‑plan optimization : continuously improve datasets, dimensions, and AI‑evaluation prompts.

Bad‑case optimization : address identified failure cases.

Decision on model launch or switch : use the aggregated score and business thresholds.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Evaluation Benchmarking Reporting dataset design model metrics

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.