Can AI Ace the Gaokao Math Test? Surprising Results from Six Top LLMs
A recent evaluation had six leading large‑language‑model products (GPT‑4o, GLM‑4, Wenxin 4.0, Doubao, Baichuan 4, and Qwen‑2.5) answer the first 14 objective questions of the new Gaokao mathematics I paper, revealing that only GLM‑4 surpassed the 60% passing threshold while the others performed far below expectations.
Background
To assess the advanced reasoning and problem‑solving abilities of large language models (LLMs), researchers selected the first 14 objective questions from the new Gaokao mathematics I exam (total score 73, passing line 43.8) and asked six top Chinese LLMs to answer them without any system prompts or external search.
Test Setup
The models evaluated were GPT‑4o, GLM‑4, Wenxin 4.0, Doubao, Baichuan 4, and Qwen‑2.5. Each model received the questions directly and returned its answer. Scoring followed the official Gaokao rules: single‑choice questions (8, 5 points each), multiple‑choice (3, 6 points each, full credit only for completely correct answers), and fill‑in‑the‑blank (3, 5 points each).
Overall Scores
The aggregated results are shown in the figure below.
GLM‑4 achieved 63 points, surpassing the passing line. Wenxin 4.0 and Baichuan 4 both scored 30, Qwen‑2.5 29, Doubao 40, and GPT‑4o 41. The gap between the top model and the others was striking.
Detailed Question Performance
For each question, the correctness of each model was recorded. Notably, the hardest single‑choice question (question 8) was answered incorrectly by all models. In several multiple‑choice and fill‑in‑the‑blank items, only GLM‑4 and occasionally other models provided correct answers, while many models gave partially correct or completely wrong responses.
Key Observations
LLMs still struggle with high‑level mathematical reasoning required by Gaokao problems.
Even the most advanced model (GLM‑4) only marginally passed, indicating room for improvement.
Performance varied widely across question types, with multiple‑choice and fill‑in‑the‑blank items exposing the greatest weaknesses.
Conclusion
The experiment demonstrates that current LLMs, despite impressive language capabilities, are not yet reliable for solving challenging mathematics problems like those on the Chinese college entrance exam. Further research is needed to enhance logical reasoning and abstract problem‑solving in AI systems.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
