How to Leverage TLM Platform for Comprehensive Large Model Evaluation
This guide explains how to use the TianJi Large Model (TLM) platform to create evaluation tasks, choose effectiveness or performance modes, work with built‑in datasets, interpret detailed reports, and understand the underlying metrics and judge‑model techniques for large‑model assessment.
1. Introduction to TLM Platform
The TianJi Large Model (TLM) development platform integrates the latest AI technologies, offering a model marketplace, data marketplace, fine‑tuning, deployment, and model evaluation capabilities, providing a full LLM‑Ops solution for building industry‑specific models on top of general‑purpose LLMs.
2. Prerequisites
1. A Zhihui Cloud account (register and complete real‑name verification if you do not have one). 2. Access to the TLM product. 3. An existing resource group (create one if needed). 4. Appropriate permissions for the account (resource‑group management or TLM management) as described in the Zhihui Cloud help documents.
3. Using Model Evaluation in TLM
3.1 Evaluation Tasks
Users can create evaluation tasks directly in the platform. The model to be evaluated may be a service deployed on the platform or an external model service. Both single‑model evaluation and pairwise comparison of two model services are supported.
3.2 Evaluation Modes
The platform provides two modes: Effectiveness evaluation and Performance evaluation.
Effectiveness Evaluation
Measures capabilities such as comprehensive ability, knowledge, comprehension, reasoning, coding, mathematics, safety, and ethics using well‑known datasets like CEVAL, MMLU, CMMlu, and GSM8K. Scores range from 0‑100 per dimension, and a judge model can be used for scoring.
Performance Evaluation
Metrics include first‑token latency (TTFT), token generation speed (TPOT), tokens per second (OTPS), request latency, input token count, output token count, and overall throughput, evaluated under varying concurrency levels.
3.3 Evaluation Datasets
TLM includes the Opencampass dataset collection, covering roughly 50 datasets across the same capability dimensions.
3.4 Evaluation Reports
After a task completes, the platform generates a report. Effectiveness reports display scores per dimension and sample data; performance reports present tables and charts (line and bar) for each metric under different concurrency settings.
4. Large‑Model Evaluation Techniques
4.1 Evaluation Scope
Evaluation assesses whether a model’s output meets expectations, thereby measuring its abilities. Currently, TLM supports language‑model evaluation (vision and multimodal models are planned).
4.2 Evaluation Types
Benchmark lists include OpenCompass, ReLE Chinese LLM benchmark, and domain‑specific lists such as LiveCodeBench (code reasoning) and C‑Eval.
4.3 Effectiveness Evaluation Principles
Two approaches are used: rule‑based scoring with predefined metrics (e.g., ROUGE, BLEU) and judge‑model scoring for tasks where n‑gram metrics are insufficient, such as sentence or paragraph generation.
4.4 Performance Evaluation Principles
Concurrency is controlled via asyncio.Semaphore. Metrics such as TTFT, TPOT, latency, output throughput, and total throughput are calculated as described.
4.5 Metric Details
Code‑related metrics evaluate generated code against test cases by concatenating code and test strings, executing them, and checking for errors. ROUGE‑N measures recall‑based n‑grams; BLEU measures precision‑based n‑grams.
4.6 Judge Model Evaluation
Effectiveness evaluation can also employ a judge model to score outputs when traditional metrics are inadequate.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
