Artificial Intelligence 9 min read

How to Leverage TLM Platform for Comprehensive Large Model Evaluation

This guide explains how to use the TianJi Large Model (TLM) platform to create evaluation tasks, choose effectiveness or performance modes, work with built‑in datasets, interpret detailed reports, and understand the underlying metrics and judge‑model techniques for large‑model assessment.

360 Zhihui Cloud Developer

Jul 23, 2025

How to Leverage TLM Platform for Comprehensive Large Model Evaluation

1. Introduction to TLM Platform

The TianJi Large Model (TLM) development platform integrates the latest AI technologies, offering a model marketplace, data marketplace, fine‑tuning, deployment, and model evaluation capabilities, providing a full LLM‑Ops solution for building industry‑specific models on top of general‑purpose LLMs.

2. Prerequisites

1. A Zhihui Cloud account (register and complete real‑name verification if you do not have one). 2. Access to the TLM product. 3. An existing resource group (create one if needed). 4. Appropriate permissions for the account (resource‑group management or TLM management) as described in the Zhihui Cloud help documents.

3. Using Model Evaluation in TLM

3.1 Evaluation Tasks

Users can create evaluation tasks directly in the platform. The model to be evaluated may be a service deployed on the platform or an external model service. Both single‑model evaluation and pairwise comparison of two model services are supported.

3.2 Evaluation Modes

The platform provides two modes: Effectiveness evaluation and Performance evaluation.

Effectiveness Evaluation

Measures capabilities such as comprehensive ability, knowledge, comprehension, reasoning, coding, mathematics, safety, and ethics using well‑known datasets like CEVAL, MMLU, CMMlu, and GSM8K. Scores range from 0‑100 per dimension, and a judge model can be used for scoring.

Performance Evaluation

Metrics include first‑token latency (TTFT), token generation speed (TPOT), tokens per second (OTPS), request latency, input token count, output token count, and overall throughput, evaluated under varying concurrency levels.

3.3 Evaluation Datasets

TLM includes the Opencampass dataset collection, covering roughly 50 datasets across the same capability dimensions.

3.4 Evaluation Reports

After a task completes, the platform generates a report. Effectiveness reports display scores per dimension and sample data; performance reports present tables and charts (line and bar) for each metric under different concurrency settings.

4. Large‑Model Evaluation Techniques

4.1 Evaluation Scope

Evaluation assesses whether a model’s output meets expectations, thereby measuring its abilities. Currently, TLM supports language‑model evaluation (vision and multimodal models are planned).

4.2 Evaluation Types

Benchmark lists include OpenCompass, ReLE Chinese LLM benchmark, and domain‑specific lists such as LiveCodeBench (code reasoning) and C‑Eval.

4.3 Effectiveness Evaluation Principles

Two approaches are used: rule‑based scoring with predefined metrics (e.g., ROUGE, BLEU) and judge‑model scoring for tasks where n‑gram metrics are insufficient, such as sentence or paragraph generation.

4.4 Performance Evaluation Principles

Concurrency is controlled via asyncio.Semaphore. Metrics such as TTFT, TPOT, latency, output throughput, and total throughput are calculated as described.

4.5 Metric Details

Code‑related metrics evaluate generated code against test cases by concatenating code and test strings, executing them, and checking for errors. ROUGE‑N measures recall‑based n‑grams; BLEU measures precision‑based n‑grams.

4.6 Judge Model Evaluation

Effectiveness evaluation can also employ a judge model to score outputs when traditional metrics are inadequate.

Performance Testing Datasets model evaluation AI metrics TLM platform

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.