EvalScope: The Ultimate Large‑Model Evaluation Framework You Control
This article introduces EvalScope, an open‑source framework for evaluating large language models, detailing its architecture, built‑in benchmarks, installation steps, and step‑by‑step guides for both performance stress testing and dataset‑based capability assessment, enabling users to independently verify model quality without relying on official documentation.
EvalScope Overview
EvalScope is a framework for large‑model evaluation and performance benchmarking. It integrates ModelScope datasets and supports large language models, multimodal models, embedding models, reranker models, and AIGC models.
Core features
Built‑in support for benchmarks and metrics such as MMLU, CMMLU, C‑Eval, GSM8K, etc.
Supports a wide range of model types including LLMs, multimodal, embedding, reranker, CLIP, and image‑to‑text/video generation models.
Provides performance stress testing with throughput and latency measurements, extending capabilities beyond OpenCompass.
Architecture
Model Adapter : Converts model outputs to the format required by the framework; supports API‑based and locally deployed models.
Data Adapter : Transforms input data to match the requirements of different evaluation tasks.
Evaluation Backend options
Native : Default backend with modes such as single‑model, arena, and baseline comparison.
OpenCompass : Wrapped version of OpenCompass for simplified task submission.
VLMEvalKit : Enables multimodal evaluation tasks.
ThirdParty : Supports additional backends like ToolBench.
RAGEval : Provides RAG evaluation using MTEB/CMTEB for embedding and reranker models and RAGAS for end‑to‑end assessment.
Performance Evaluator : Measures inference service performance, conducts stress tests, and generates visual reports.
Evaluation Report : Aggregates results into a report for decision‑making and model optimization.
Visualization : Presents results in an intuitive UI for easy comparison.
Installation
Create an isolated Conda environment and install EvalScope and optional backends:
conda create -n evalscope python=3.11
conda activate evalscope
pip install evalscope
# optional extras
pip install 'evalscope[opencompass]'
pip install 'evalscope[vlmeval]'
pip install 'evalscope[rag]'
pip install 'evalscope[perf]'
pip install 'evalscope[app]'
pip install 'evalscope[all]'Stress‑testing example
Python code to benchmark Qwen3‑8B on SiliconFlow:
from evalscope.perf.main import run_perf_benchmark
def run_perf(parallel):
task_cfg = {
'url': 'https://api.siliconflow.cn/v1/chat/completions',
'parallel': parallel,
'model': 'Qwen/Qwen3-8B',
'number': 10,
'api': 'openai',
'dataset': 'openqa',
'stream': False,
'debug': False,
'headers': {'Authorization': 'Bearer '},
'connect_timeout': 6000,
'read_timeout': 6000,
'max_tokens': 512,
}
run_perf_benchmark(task_cfg)
run_perf(parallel=1)After execution, EvalScope downloads the selected dataset, issues 100 concurrent requests to the Qwen3‑8B endpoint, and stores results in the outputs directory. Important result files are benchmark_summary.json and benchmark_percentile.json.
Capability evaluation example
Configuration for evaluating Qwen3‑8B on GSM8K and MMLU:
from evalscope import TaskConfig, run_task
from evalscope.constants import EvalType
task_cfg = TaskConfig(
model='Qwen/Qwen3-8B',
api_url='https://api.siliconflow.cn/v1/chat/completions',
api_key='',
eval_type=EvalType.SERVICE,
datasets=['gsm8k', 'mmlu'],
limit=2,
timeout=60000,
stream=True,
)
run_task(task_cfg=task_cfg)The run produces per‑dataset scores (1.0 = all correct, 0.5 = half correct, 0 = none correct) saved under outputs. The built‑in visualization app can be launched with evalscope app, which serves a UI at http://localhost:7860 showing overall scores and detailed predictions.
Supported evaluation datasets
MMLU : 57 subjects covering math, physics, law, medicine, etc.
C‑Eval / CMMLU : Chinese‑language knowledge tests covering 52–57 subjects.
GSM8K : 8.5K elementary math word problems for step‑by‑step reasoning.
HumanEval : 164 programming tasks for code generation evaluation.
TruthfulQA : 817 questions designed to expose hallucinations.
GAOKAO‑Bench : Chinese college‑entrance exam questions for logical reasoning.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
