Can AI Ace the Gaokao Math Test? Surprising Results from Six Top LLMs

A recent evaluation had six leading large‑language‑model products (GPT‑4o, GLM‑4, Wenxin 4.0, Doubao, Baichuan 4, and Qwen‑2.5) answer the first 14 objective questions of the new Gaokao mathematics I paper, revealing that only GLM‑4 surpassed the 60% passing threshold while the others performed far below expectations.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
Can AI Ace the Gaokao Math Test? Surprising Results from Six Top LLMs

Background

To assess the advanced reasoning and problem‑solving abilities of large language models (LLMs), researchers selected the first 14 objective questions from the new Gaokao mathematics I exam (total score 73, passing line 43.8) and asked six top Chinese LLMs to answer them without any system prompts or external search.

Test Setup

The models evaluated were GPT‑4o, GLM‑4, Wenxin 4.0, Doubao, Baichuan 4, and Qwen‑2.5. Each model received the questions directly and returned its answer. Scoring followed the official Gaokao rules: single‑choice questions (8, 5 points each), multiple‑choice (3, 6 points each, full credit only for completely correct answers), and fill‑in‑the‑blank (3, 5 points each).

Overall Scores

The aggregated results are shown in the figure below.

LLM score summary
LLM score summary

GLM‑4 achieved 63 points, surpassing the passing line. Wenxin 4.0 and Baichuan 4 both scored 30, Qwen‑2.5 29, Doubao 40, and GPT‑4o 41. The gap between the top model and the others was striking.

Detailed Question Performance

For each question, the correctness of each model was recorded. Notably, the hardest single‑choice question (question 8) was answered incorrectly by all models. In several multiple‑choice and fill‑in‑the‑blank items, only GLM‑4 and occasionally other models provided correct answers, while many models gave partially correct or completely wrong responses.

Key Observations

LLMs still struggle with high‑level mathematical reasoning required by Gaokao problems.

Even the most advanced model (GLM‑4) only marginally passed, indicating room for improvement.

Performance varied widely across question types, with multiple‑choice and fill‑in‑the‑blank items exposing the greatest weaknesses.

Conclusion

The experiment demonstrates that current LLMs, despite impressive language capabilities, are not yet reliable for solving challenging mathematics problems like those on the Chinese college entrance exam. Further research is needed to enhance logical reasoning and abstract problem‑solving in AI systems.

AIlarge language modelsmodel evaluationGaokaomath examGLM-4
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.