Artificial Intelligence 25 min read

HeartBench: Building the First Chinese AI Humanization Benchmark

This article details the creation of HeartBench, a Chinese benchmark for evaluating large language models' emotional and social intelligence, describing its background, design principles, data pipeline, evaluation methods, multi‑stage versioning, blind‑test validation, and lessons for building transferable AI assessment frameworks.

Alibaba Cloud Developer

Mar 16, 2026

HeartBench: Building the First Chinese AI Humanization Benchmark

Background

By 2025 the scaling of large language models (LLMs) shows diminishing returns, prompting a shift from pure compute‑driven development to definition‑driven AI. Models excel at knowledge and reasoning but still lack robust emotional, social, and cultural understanding, especially in non‑English contexts. This motivates a systematic way to define and measure “human‑like” abilities of LLMs.

HeartBench Overview

HeartBench is an open‑source benchmark that evaluates LLMs on psychological and social‑science abilities. It targets five primary dimensions—personality, emotion, social interaction, morality, and motivation—sub‑divided into fifteen fine‑grained abilities. The current pool contains 1,126 curated questions covering 33 real‑world scenarios; 296 of them are released publicly. Rubrics were created from 10,772 raw items and refined to 2,818 actionable scoring criteria.

Evaluation Framework

Goal

The goal is to assess whether a model truly understands humans rather than merely appears human, focusing on universally positive traits such as high emotional intelligence, deep empathy, and appropriate boundaries.

Principles

Real‑world alignment : Scores must reflect impact in realistic situations (e.g., a joke that is funny in one culture may be offensive in another).

Consistency : Model scores should align with human judgments across diverse contexts.

Challenge : Tasks must be difficult enough to expose model limits.

Systematic coverage : The benchmark should comprehensively cover the defined abilities without blind spots.

Diversity : Content spans multiple disciplines, difficulty levels, and cultural backgrounds.

Dimensions

Derived from psychology, nine undergraduate psychology students mapped >5,000 hours of clinical experience and >100 hours of AI‑user interaction into a hierarchy of five primary and fifteen secondary abilities.

Question Types and Scoring

Multiple‑choice : Choose the correct answer (e.g., MMLU, HellaSwag). Scored by accuracy.

Open‑ended static : Single‑turn response evaluated with rubrics (e.g., HealthBench, MultiChallenge).

Open‑ended dynamic : Multi‑turn dialogue scored with rubrics (e.g., PersonaLens).

Pairwise comparison : Judges compare two model outputs (e.g., LitBench, SuperCLUE) and report win rate.

Rating/ordering : Models rate emotional intensity of multiple emotions (e.g., EQBench) and are scored by normalized averages.

Human Blind‑Test Validation

Because open‑ended social‑science tasks lack objective answers, a double‑blind human evaluation was performed. 40 % of the dataset was sampled and rated by >20 psychology experts on 14 mainstream LLMs. Each question received three independent ratings; a majority vote (≥2) defined the human consensus. The LLM‑as‑judge system achieved 86 % agreement with this consensus.

Iterative Development (V0.1 → V1.0)

V0.1 – Exploration : Static multi‑turn scripts yielded < 20 % discrimination; rubrics were inconsistent (expert agreement 36 %).

V0.2 – Small‑sample consensus : Manual authoring from real counseling dialogues raised discrimination and consistency.

V0.5 – Scale‑up via human‑machine collaboration : Prompt‑driven LLM generation followed by expert refinement produced 1,126 items, but quality gaps remained.

V1.0 – Refined filtering : Automated compliance checks and expert re‑review reduced the set to ~560, then after further legal and ethical review to the final 296 high‑quality questions.

Final Construction & Evaluation Pipeline

Data source : Collect raw text from web pages, authored dialogues, books; clean, label, and cluster to create structured material.

Construction pipeline :

LLM‑assisted rewriting for privacy and fluency.

Expert review and rubric generation.

Problem synthesis from refined material.

LLM and human answer generation.

Iterative rubric refinement based on answers.

Evaluation system :

Apply case‑specific rubrics.

Separate normal and hard difficulty sets.

Use an LLM‑as‑judge for large‑scale scoring.

Validate with periodic human blind‑tests (≥75 % agreement required).

Transferable Methodology

A six‑step, repeatable workflow for building scientific benchmarks in any domain:

Domain research : Survey 20‑30 papers, then deep‑dive into 5‑10 core works to build a knowledge map.

Framework design : Define what to evaluate, how, and who will evaluate; co‑create rubrics with experts.

Seed data collection : Gather raw sources, filter for compliance, representativeness, and evaluative power.

Small‑scale pilot : Run quick experiments to validate difficulty and consistency.

Scaling : Combine LLM‑generated drafts with expert refinement; enforce quality gates.

Effectiveness validation : Ensure metrics are explainable, reproducible, and verified by human blind‑tests before release.

Expert Annotation Platform Insights

Ability tiering & intelligent matching : Profile experts by domain, past quality, and specialty; auto‑assign tasks based on difficulty and expertise.

Incentive structures : Piece‑rate pay, certification (e.g., “AI Evaluation Expert”), co‑authorship on papers, and community events.

Quality assurance : Pre‑task calibration tests, dynamic cross‑validation (3‑person overlap), and expert consultation for divergent cases.

Project management tools : Visual dashboards for progress, quality trends, risk alerts; flexible task routing; integrated chat and voting linked to specific items.

Knowledge capture : Structured case libraries, searchable FAQs, rubric versioning with change logs.

Resources

GitHub repository: https://github.com/inclusionAI/HeartBench Paper (arXiv):

https://arxiv.org/abs/2512.21849

large language models AI benchmark evaluation Psychology Humanization Emotion AI

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.