Artificial Intelligence 12 min read

EvalScope: The Ultimate Large‑Model Evaluation Framework You Control

This article introduces EvalScope, an open‑source framework for evaluating large language models, detailing its architecture, built‑in benchmarks, installation steps, and step‑by‑step guides for both performance stress testing and dataset‑based capability assessment, enabling users to independently verify model quality without relying on official documentation.

Fun with Large Models

Jun 5, 2025

EvalScope: The Ultimate Large‑Model Evaluation Framework You Control

EvalScope Overview

EvalScope is a framework for large‑model evaluation and performance benchmarking. It integrates ModelScope datasets and supports large language models, multimodal models, embedding models, reranker models, and AIGC models.

Core features

Built‑in support for benchmarks and metrics such as MMLU, CMMLU, C‑Eval, GSM8K, etc.

Supports a wide range of model types including LLMs, multimodal, embedding, reranker, CLIP, and image‑to‑text/video generation models.

Provides performance stress testing with throughput and latency measurements, extending capabilities beyond OpenCompass.

Architecture

Model Adapter : Converts model outputs to the format required by the framework; supports API‑based and locally deployed models.

Data Adapter : Transforms input data to match the requirements of different evaluation tasks.

Evaluation Backend options

Native : Default backend with modes such as single‑model, arena, and baseline comparison.

OpenCompass : Wrapped version of OpenCompass for simplified task submission.

VLMEvalKit : Enables multimodal evaluation tasks.

ThirdParty : Supports additional backends like ToolBench.

RAGEval : Provides RAG evaluation using MTEB/CMTEB for embedding and reranker models and RAGAS for end‑to‑end assessment.

Performance Evaluator : Measures inference service performance, conducts stress tests, and generates visual reports.

Evaluation Report : Aggregates results into a report for decision‑making and model optimization.

Visualization : Presents results in an intuitive UI for easy comparison.

Installation

Create an isolated Conda environment and install EvalScope and optional backends:

conda create -n evalscope python=3.11
conda activate evalscope
pip install evalscope
# optional extras
pip install 'evalscope[opencompass]'
pip install 'evalscope[vlmeval]'
pip install 'evalscope[rag]'
pip install 'evalscope[perf]'
pip install 'evalscope[app]'
pip install 'evalscope[all]'

Stress‑testing example

Python code to benchmark Qwen3‑8B on SiliconFlow:

from evalscope.perf.main import run_perf_benchmark

def run_perf(parallel):
    task_cfg = {
        'url': 'https://api.siliconflow.cn/v1/chat/completions',
        'parallel': parallel,
        'model': 'Qwen/Qwen3-8B',
        'number': 10,
        'api': 'openai',
        'dataset': 'openqa',
        'stream': False,
        'debug': False,
        'headers': {'Authorization': 'Bearer '},
        'connect_timeout': 6000,
        'read_timeout': 6000,
        'max_tokens': 512,
    }
    run_perf_benchmark(task_cfg)

run_perf(parallel=1)

After execution, EvalScope downloads the selected dataset, issues 100 concurrent requests to the Qwen3‑8B endpoint, and stores results in the outputs directory. Important result files are benchmark_summary.json and benchmark_percentile.json.

Capability evaluation example

Configuration for evaluating Qwen3‑8B on GSM8K and MMLU:

from evalscope import TaskConfig, run_task
from evalscope.constants import EvalType

task_cfg = TaskConfig(
    model='Qwen/Qwen3-8B',
    api_url='https://api.siliconflow.cn/v1/chat/completions',
    api_key='',
    eval_type=EvalType.SERVICE,
    datasets=['gsm8k', 'mmlu'],
    limit=2,
    timeout=60000,
    stream=True,
)
run_task(task_cfg=task_cfg)

The run produces per‑dataset scores (1.0 = all correct, 0.5 = half correct, 0 = none correct) saved under outputs. The built‑in visualization app can be launched with evalscope app, which serves a UI at http://localhost:7860 showing overall scores and detailed predictions.

Supported evaluation datasets

MMLU : 57 subjects covering math, physics, law, medicine, etc.

C‑Eval / CMMLU : Chinese‑language knowledge tests covering 52–57 subjects.

GSM8K : 8.5K elementary math word problems for step‑by‑step reasoning.

HumanEval : 164 programming tasks for code generation evaluation.

TruthfulQA : 817 questions designed to expose hallucinations.

GAOKAO‑Bench : Chinese college‑entrance exam questions for logical reasoning.

large language models performance testing model evaluation Visualization benchmark datasets EvalScope

Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.