Artificial Intelligence 26 min read

Hands‑On Large‑Model Evaluation: Dataset and Automated Scoring with EvalScope

This article walks through practical large‑model evaluation using the EvalScope platform, covering dataset‑based testing, multi‑dataset aggregation, custom data creation, the BLEU and ROUGE metrics, and how to employ a judge LLM for automated, quantifiable scoring.

Fun with Large Models

May 28, 2026

Hands‑On Large‑Model Evaluation: Dataset and Automated Scoring with EvalScope

Introduction

The previous article introduced the full evaluation framework for large models. This installment focuses on the practical side: how to run dataset‑based evaluations and automated scoring in real production scenarios.

EvalScope: an end‑to‑end evaluation platform

EvalScope, released by the ModelScope community, integrates model loading, dataset handling, and result visualization. Its core features are:

Comprehensive coverage : built‑in benchmarks such as MMLU, CMMLU, C‑Eval, GSM8K, HumanEval, supporting LLMs, multimodal models, embeddings, rerankers, CLIP, and AIGC models.

Ease of use : both command‑line and Python APIs; a single command launches a standard dataset evaluation.

Rich functionality : besides accuracy, it provides throughput/latency profiling and visual reports.

Architecture

EvalScope consists of three layers:

Input side : configure the target model (API or local transformers load) and select a dataset (built‑in or custom).

Component side :

Model Adapter – normalizes outputs from different model back‑ends.

Data Adapter – converts raw inputs to the format required by each benchmark.

Evaluation Backend – supports four modes:

Native – the default engine with single‑model, arena, and baseline comparison modes.

OpenCompass – an integrated third‑party framework.

VLMEvalKit – for multimodal tasks.

ThirdParty – e.g., ToolBench, RAGEval.

Output side : generates a report with accuracy metrics and visual charts, storing predictions, reports, and an HTML file for easy comparison.

Environment setup

Using the Lab4AI environment, create a Conda environment and install EvalScope with all optional features:

# Create conda environment
conda create -n evalscope python=3.12
conda activate evalscope
# Install EvalScope with full functionality
pip install "evalscope[all]"
# Verify installation
evalscope --help

Start the model service (example uses vllm to serve Qwen2.5-0.5B-Instruct on port 6666):

vllm serve ./Qwen2_5_0_5/ \
  --served-model-name Qwen2.5-0.5B \
  --max-model-len 8048 \
  --gpu-memory-utilization 0.9 \
  --port 6666

1.3.1 Single‑dataset evaluation

To assess the mathematical ability of Qwen2.5-0.5B-Instruct on the GSM8K benchmark (accuracy metric), run:

from evalscope.run import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='Qwen2.5-0.5B',
    api_url='http://127.0.0.1:6666/v1',
    api_key='EMPTY',
    datasets=['gsm8k'],
    limit=50  # sample 50 items for quick demo
)
run_task(task_cfg=task_cfg)

The script downloads the dataset to ~/.cache/modelscope/hub/datasets, runs the evaluation, and prints an accuracy of 50 % for this sample. Results are saved under outputs with subfolders predictions, reports, and a visual report.html.

Evaluating a subset of a multi‑category dataset

When a dataset such as ceval contains many sub‑categories, you can select only the desired ones via the dataset_args parameter:

task_cfg = TaskConfig(
    model='Qwen2.5-0.5B',
    api_url='http://127.0.0.1:6666/v1',
    api_key='EMPTY',
    datasets=['ceval'],
    limit=10,
    dataset_args={
        'ceval': {
            'subset_list': ['computer_network', 'operating_system']
        }
    }
)
run_task(task_cfg=task_cfg)

The resulting report shows accuracy per sub‑category; for example, the model performs poorly on computer_network .

1.3.2 Multi‑dataset aggregation

To ensure a model does not over‑fit to a single domain, combine several benchmarks (e.g., ceval and gsm8k) in one run:

task_cfg = TaskConfig(
    model='Qwen2.5-0.5B',
    api_url='http://127.0.0.1:6666/v1',
    api_key='EMPTY',
    datasets=['ceval', 'gsm8k'],
    limit=10,
    dataset_args={
        'ceval': {'subset_list': ['computer_network', 'operating_system']},
        'gsm8k': {'few_shot_num': 0}
    },
    generation_config={
        'max_tokens': 2048,
        'temperature': 0.0,
        'top_p': 1.0,
        'do_sample': False
    }
)
run_task(task_cfg=task_cfg)

The combined report visualizes each dataset’s accuracy side‑by‑side.

1.3.3 Custom dataset evaluation

For tasks without a predefined benchmark (e.g., domain‑specific QA), you can supply a jsonl file. Two formats are supported:

Multiple‑choice : fields question, A ‑ D, answer. Example:

{"id": "1", "question": "通常来说，组成动物蛋白质的氨基酸有____", "A": "4种", "B": "22种", "C": "20种", "D": "19种", "answer": "C"}

Open‑ended QA : fields question (or query) and response. Example:

{"query": "世界上最高的山是哪座山？", "response": "是珠穆朗玛峰"}

Place the file under a directory (e.g., mcq/example_val.jsonl) and reference it via dataset_args:

task_cfg = TaskConfig(
    model='Qwen2.5-0.5B',
    api_url='http://127.0.0.1:6666/v1',
    datasets=['general_mcq'],
    dataset_args={
        'general_mcq': {
            'local_path': './mcq',
            'subset_list': ['example']
        }
    }
)
run_task(task_cfg=task_cfg)

The same approach works for open‑ended QA datasets.

2 BLEU and ROUGE: automatic judges for generated text

Both metrics originate from machine translation and summarization. BLEU measures precision (how many generated n‑grams appear in the reference) with a length‑penalty to discourage overly short outputs. ROUGE measures recall (how many reference n‑grams are covered by the generation).

2.1 n‑gram basics

An n‑gram is a contiguous sequence of n tokens. For the sentence cat sits on mat:

1‑gram: cat, sits, on, mat 2‑gram: cat sits, sits on, on mat 3‑gram: cat sits on,

sits on mat

2.2 BLEU (precision‑oriented)

BLEU computes the geometric mean of n‑gram precisions (usually up to 4‑gram) and multiplies by a brevity penalty (BP). Example: reference the cat is on the mat, model output the the the the the. 1‑gram matches 2 of 5 tokens → p₁=0.4, BP≈0.82, BLEU≈0.33, illustrating the penalty for repetitive short output.

2.3 ROUGE (recall‑oriented)

ROUGE‑N = (matched n‑grams) / (total n‑grams in reference). Using the same sentence, if the generated text is mat cat sits, ROUGE‑1 = 3/4 = 0.75, ROUGE‑2 = 1/3 ≈ 0.33.

In practice, BLEU emphasizes precision (avoiding hallucination), while ROUGE emphasizes coverage (avoiding omission). Combining both gives a balanced view.

3 Using a judge LLM for automated scoring

Beyond fixed benchmarks, you can let another LLM act as a “referee”. The workflow consists of four steps:

Define scoring dimensions (e.g., relevance, accuracy, completeness) and a 1‑5 rubric.

Craft a structured prompt that includes the user question, the gold answer, the model answer, and the rubric.

Call the judge model (e.g., DeepSeek) with low temperature and request JSON output.

Aggregate scores (average, distribution) and generate a report.

3.2 Prompt design tips

Clear task description : state that the model must compare the expected answer with the generated answer and assign scores per dimension.

Explicit scoring rules : avoid vague “good/bad”; provide concrete criteria (e.g., “Accuracy 5 pts: no factual errors”).

Few‑shot examples : include 1‑2 annotated examples to guide the judge.

Domain‑specific example (Traditional Chinese Medicine)

The author defines three dimensions: syndrome accuracy (0‑4), terminology correctness (0‑3), and medication safety (0‑3). Sample prompt includes detailed scoring rubrics and two illustrative cases (high‑quality vs. low‑quality). The code below calls DeepSeek and parses the JSON result:

import json, os, pandas as pd
from openai import OpenAI

client = OpenAI(api_key=os.environ["DEEPSEEK_API_KEY"], base_url="https://api.deepseek.com")

PROMPT = """
You are an expert TCM specialist tasked with evaluating another model's answer.

[Evaluation Task]
Compare the "gold answer" with the "model answer" on three dimensions:
1. Syndrome accuracy (0‑4)
2. Terminology correctness (0‑3)
3. Medication safety (0‑3)

[Scoring Rules]
... (rules omitted for brevity) ...

[Output format]
{ "syndrome_score": int, "term_score": int, "safety_score": int, "total_score": int, "brief_reason": "..." }
"""

def score(question, expected, generated):
    resp = client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role": "user", "content": PROMPT.format(question=question, expected=expected, generated=generated)}],
        temperature=0.1,
        response_format={"type": "json_object"}
    )
    return json.loads(resp.choices[0].message.content)

# Batch evaluation example
df = pd.read_json("test.json")
scores = [score(row["q"], row["exp"], row["gen"]) for _, row in df.iterrows()]
print(f"Average overall score: {sum(s["overall"] for s in scores)/len(scores):.2f}")

This approach yields a multi‑dimensional, quantifiable assessment of generated answers.

Conclusion

The guide covered the full practical workflow for large‑model evaluation: using EvalScope for dataset‑based and custom evaluations, understanding BLEU and ROUGE, and building an automated judge LLM pipeline. The next article will demonstrate a complete end‑to‑end security‑domain LLM case study.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models model evaluation BLEU dataset evaluation EvalScope automated scoring ROUGE

Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.