Cognitive Technology Team
Oct 16, 2024 · Artificial Intelligence
Large Language Models Lack Formal Reasoning Ability: Five Pieces of Evidence from the GSM‑Symbolic Benchmark
Recent research by Apple’s Iman Mirzadeh team introduces the GSM‑Symbolic benchmark, revealing that large language models, despite high scores on GSM8K, exhibit significant performance drops when problem numbers, names, or extra clauses change, indicating a lack of true formal reasoning ability.
AI SafetyBenchmarkGSM‑Symbolic
0 likes · 9 min read