Do Large Language Models Truly Grasp Phrase Semantics? Findings from ACL 2026 Oral
The SemanticQA benchmark breaks phrase‑level semantic understanding into extraction, categorization and interpretation tasks, evaluates over ten models—including GPT‑5, Claude Sonnet and Gemini 2.5 Pro—and reveals systematic gaps, performance drops with finer categories, and error propagation in multi‑step pipelines.
SemanticQA Benchmark
SemanticQA is an operation‑aligned diagnostic benchmark that decomposes phrase‑level semantic understanding into three atomic operations:
Extraction : locate the target phrase in a sentence with exact span‑level matching.
Classification : assign the phrase to one of four semantic types—idiomatic expression, lexical collocation, noun compound, verbal MWE.
Interpretation : generate a natural‑language definition of the phrase in its context.
All three tasks share a unified prompt template, ensuring that the same phrase instance is presented identically across operations. This design isolates true semantic ability from prompt‑format effects.
Dataset Construction
The benchmark aggregates several existing semantic annotation resources, covering thousands of test items across the four MWE categories. Each item provides a standardized input sentence, a required output format, and a fixed prompt. The data retain the heterogeneity of the source corpora (different annotation protocols, difficulty distributions, and granularity) while enforcing a common interface.
Evaluated Models
More than ten models are evaluated, ranging from classic architectures (BERT, T5) to recent frontier systems (GPT‑5, Claude Sonnet, DeepSeek‑R1, Gemini 2.5 Pro), covering both open‑source and closed‑source offerings.
Key Findings
Systematic Gaps : No model excels across all three operations. Performance gaps are especially pronounced between extraction and the other two tasks.
Classification : Models achieve reasonable accuracy on coarse‑grained categories but struggle with fine‑grained semantic relations.
Extraction : Exact‑match extraction rates remain low even when classification scores are decent, indicating reliance on surface patterns rather than true syntactic‑semantic understanding.
Interpretation : Generated definitions often receive high similarity scores (e.g., BERTScore) yet contain factual deviations; models can “sound right” without being correct.
Example: GPT‑5 attains 85.4% accuracy on idiom classification (5‑shot) but only 78.7% exact‑match extraction and 22.5% Meteor similarity for idiom interpretation.
When the number of semantic categories increases from 2‑4 to 16, performance degrades sharply. DeepSeek‑R1’s classification accuracy drops from 81.7% to 35.4% (a 46.3‑point decline); GPT‑5 remains more stable but still falls off on the 16‑class setting.
Multi‑Step Error Propagation
SemanticQA includes compositional tasks that mimic real‑world pipelines (extract → interpret or extract → classify). Errors in upstream extraction dramatically reduce downstream quality. In an “extract + interpret” chain, GPT‑5’s extraction accuracy is 41.3% while the final Meteor similarity falls to 17.3%.
Analysis of Model Behavior
Across three years of analysis of OpenAI models (GPT‑3.5‑Turbo, GPT‑4, o3, GPT‑5), a consistent ranking order emerges (o3 > GPT‑4 > GPT‑3.5‑Turbo) and few‑shot prompting generally outperforms zero‑shot.
Scaling model size does not guarantee finer‑grained semantic understanding. In the 16‑class setting, large models exhibit larger performance drops than smaller, domain‑supervised models, suggesting reliance on statistical co‑occurrence rather than structured semantic representations.
Practical Implications
Single‑metric, single‑task evaluations miss critical failure modes; multi‑operation benchmarks are essential for revealing true semantic capability.
Few‑shot prompting benefits classification more than extraction; example quality matters more than quantity for extraction.
Model performance is tightly coupled to task format; high scores on similarity metrics do not imply robust semantic reasoning.
In multi‑step pipelines, upstream extraction errors cascade, causing downstream interpretation or classification to collapse.
Limitations
The benchmark was designed for static, single‑turn evaluation (2023 design, 2025 revision). In agent‑centric, long‑running scenarios, static tests cannot capture temporal error accumulation, highlighting the need for dynamic, adaptive evaluation frameworks.
Resources
Project homepage: https://semanticqa.github.io
Paper: https://arxiv.org/pdf/2604.16593
Implementation repository: https://github.com/jacklanda/SemanticQA
Code example
[1] Shwartz and Dagan. Still a Pain in the Neck: Evaluating Text Representations on Lexical Composition. TACL 2019.
[2] Wei et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
[3] Constant et al. Multiword Expression Processing: A Survey. Computational Linguistics 2017.
[4] Coil and Shwartz. From Chocolate Bunny to Chocolate Crocodile: Do Language Models Understand Noun Compounds? ACL Findings 2023.
[5] Espinosa-Anke et al. Evaluating Language Models for the Retrieval and Categorization of Lexical Collocations. EACL 2021.
[6] Chakrabarty et al. It’s Not Rocket Science: Interpreting Figurative Language in Narratives. TACL 2022.
[7] Pham et al. PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search. EACL 2023.
[8] Ramisch et al. A Survey of MWE Identification Experiments: The Devil is in the Details. MWE Workshop 2023.
[9] Miletic and Schulte im Walde. Semantics of Multiword Expressions in Transformer-based Models: A Survey. TACL 2024.
[10] Zeng and Bhat. Getting BART to Ride the Idiomatic Train: Learning to Represent Idiomatic Expressions. TACL 2022.Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
