Artificial Intelligence 7 min read

Can AI Ace the Gaokao Math Test? Surprising Results from Six Top LLMs

A recent evaluation had six leading large‑language‑model products (GPT‑4o, GLM‑4, Wenxin 4.0, Doubao, Baichuan 4, and Qwen‑2.5) answer the first 14 objective questions of the new Gaokao mathematics I paper, revealing that only GLM‑4 surpassed the 60% passing threshold while the others performed far below expectations.

Java High-Performance Architecture

Aug 25, 2024

Can AI Ace the Gaokao Math Test? Surprising Results from Six Top LLMs

Background

To assess the advanced reasoning and problem‑solving abilities of large language models (LLMs), researchers selected the first 14 objective questions from the new Gaokao mathematics I exam (total score 73, passing line 43.8) and asked six top Chinese LLMs to answer them without any system prompts or external search.

Test Setup

The models evaluated were GPT‑4o, GLM‑4, Wenxin 4.0, Doubao, Baichuan 4, and Qwen‑2.5. Each model received the questions directly and returned its answer. Scoring followed the official Gaokao rules: single‑choice questions (8, 5 points each), multiple‑choice (3, 6 points each, full credit only for completely correct answers), and fill‑in‑the‑blank (3, 5 points each).

Overall Scores

The aggregated results are shown in the figure below.

GLM‑4 achieved 63 points, surpassing the passing line. Wenxin 4.0 and Baichuan 4 both scored 30, Qwen‑2.5 29, Doubao 40, and GPT‑4o 41. The gap between the top model and the others was striking.

Detailed Question Performance

For each question, the correctness of each model was recorded. Notably, the hardest single‑choice question (question 8) was answered incorrectly by all models. In several multiple‑choice and fill‑in‑the‑blank items, only GLM‑4 and occasionally other models provided correct answers, while many models gave partially correct or completely wrong responses.

Key Observations

LLMs still struggle with high‑level mathematical reasoning required by Gaokao problems.

Even the most advanced model (GLM‑4) only marginally passed, indicating room for improvement.

Performance varied widely across question types, with multiple‑choice and fill‑in‑the‑blank items exposing the greatest weaknesses.

Conclusion

The experiment demonstrates that current LLMs, despite impressive language capabilities, are not yet reliable for solving challenging mathematics problems like those on the Chinese college entrance exam. Further research is needed to enhance logical reasoning and abstract problem‑solving in AI systems.

AI large language models model evaluation Gaokao math exam GLM-4

Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.