Tagged articles

human evaluation

3 articles · Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

Jul 9, 2026 · Artificial Intelligence

OpenAI Unveils GPT‑Live: Toward Human‑like Real‑time Voice Conversations

OpenAI’s new GPT‑Live, built on a full‑duplex architecture and a delegation mechanism to the latest GPT‑5.5 backend, enables continuous, natural‑sounding voice interactions and outperforms the previous Advanced Voice Mode in human‑centered evaluations across preference, turn‑taking, and interruption metrics.

ChatGPT VoiceGPT‑Livedelegation

0 likes · 9 min read

OpenAI Unveils GPT‑Live: Toward Human‑like Real‑time Voice Conversations

ThinkingAgent

Jun 16, 2026 · Artificial Intelligence

A Systematic Approach to AI Evaluation: From Benchmarks to Real‑World Scenarios

This article outlines a comprehensive methodology for evaluating large language models, covering classic benchmarks, human and multimodal assessments, common pitfalls such as data contamination and benchmark overfitting, and practical guidelines for building a scientific, multi‑layered AI evaluation framework.

AI evaluationLLM benchmarksLLM-as-Judge

0 likes · 27 min read

A Systematic Approach to AI Evaluation: From Benchmarks to Real‑World Scenarios

Fun with Large Models

May 22, 2026 · Artificial Intelligence

How to Rigorously Evaluate Large Models: Methods and Key Benchmark Datasets

This guide explains why systematic evaluation is essential for large models, outlines three core evaluation approaches—human assessment, benchmark‑dataset testing, and automated judge models—introduces the most widely used benchmark suites, and shows how to use the open‑source EvalScope framework and prompt‑design techniques to conduct reliable model assessments.

EvalScopePrompt Designautomated judge

0 likes · 17 min read

How to Rigorously Evaluate Large Models: Methods and Key Benchmark Datasets