Artificial Intelligence 10 min read

Evaluating Open-Source LLMs with Alibaba Cloud's Themis Judge Model

This guide explains how to use Alibaba Cloud's PAI platform and the Themis judge model to efficiently evaluate large language models on custom or public datasets, covering data preparation, task submission, result analysis, multi‑model comparison, and API integration.

Alibaba Cloud Big Data AI Platform

Oct 21, 2024

Evaluating Open-Source LLMs with Alibaba Cloud's Themis Judge Model

Why LLM Evaluation Matters

In the era of large models, the rapid improvement of model performance makes evaluation increasingly critical. Traditional benchmarks such as MMLU, CMMLU, and GSM8K focus on deterministic questions, leaving open‑ended scenarios like chat assistants difficult to assess.

Judge‑Model Approach

To address this gap, the industry uses a stronger LLM as a judge to score other models on open‑ended tasks, providing a metric closer to human preferences. Alibaba Cloud's PAI model evaluation platform offers this capability through the Themis judge model, which is fine‑tuned from the Qwen series on extensive evaluation data. Themis matches GPT‑4 on many benchmarks and even outperforms it in certain scenarios. This feature is currently free for a limited time.

PAI Large‑Model Evaluation Platform Overview

The platform enables scientific, efficient evaluation, helping developers compare model performance, select the best model, and accelerate AI innovation. It supports two evaluation dimensions: custom datasets and public datasets.

Custom‑Dataset Evaluation

Rule‑based scoring using ROUGE, BLEU, etc.

Judge‑model scoring that rates each Q&A pair and aggregates results.

Public‑Dataset Evaluation Runs models on various public datasets and reports standard industry metrics.

Practical Development Steps

1. Data Preparation

The judge‑model evaluation requires a JSONL file where each line is a dictionary with a question field and an optional answer field.

[{"question": "在守望先锋中，碰到外挂该怎么办？", "answer": "如果在游戏中遇到使用作弊工具或外挂的玩家，你可以使用内置的举报系统来报告可疑行为。"}]

You can optionally upload the file to OSS and create a dataset (links provided in the original text).

2. Submit a Judge‑Model Evaluation Task

Access the evaluation page via two routes: PAI console → Model Gallery → LLM model card → Evaluate, or the fine‑tuning task detail page → Evaluate. Switch to expert mode, fill in task details, select the judge‑model option, agree to the free service (which automatically provides a token), choose or upload a custom dataset, select resources, and configure inference hyper‑parameters.

3. Evaluation Result Analysis

After completion, the platform displays metrics such as Mean, Median, Standard Deviation, and Skewness, along with detailed per‑item scoring reasons.

4. Multi‑Task Comparison

Select multiple evaluation tasks in the Model Gallery → Task Management page and click “Compare” to view side‑by‑side results.

5. Themis Model API Call

The model can be tried online at modelscope.cn/studios/PAI/PAI-Themis or invoked via HTTP. Example using curl:

$ curl -X POST http://ai-service.example.com/v1/chat/completions \
  -H "Authorization: Bearer ${THEMIS_TOKEN}" -H "Content-Type: application/json" \
  -d '{"model":"themis-turbo","messages":[{"role":"user","content":[{"mode":"single","type":"json","json":{"question":"9.9和9.11哪个大？","answer":"..."}}]}],"temperature":0.2}'

The response includes a comprehensive score and detailed justification.

Alibaba Cloud open-source LLM LLM evaluation model benchmarking PAI platform Themis model

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.