Machine Heart
Jun 11, 2026 · Artificial Intelligence
Do Large Language Models Truly Grasp Phrase Semantics? Findings from ACL 2026 Oral
The SemanticQA benchmark breaks phrase‑level semantic understanding into extraction, categorization and interpretation tasks, evaluates over ten models—including GPT‑5, Claude Sonnet and Gemini 2.5 Pro—and reveals systematic gaps, performance drops with finer categories, and error propagation in multi‑step pipelines.
SemanticQAevaluation benchmarklarge language models
0 likes · 18 min read
