Grok 4: The ‘Problem‑Solving Champion’ That Falters in Real‑World Use – Detailed Evaluation
The article reviews Grok 4’s flashy launch and claimed first‑principles advantage, then presents benchmark results—showing strong reasoning, multimodal and agent scores but disappointing coding performance versus DeepSeek‑R1—concluding that the model’s real‑world capabilities fall short of its hype.
Introduction
On July 9, Elon Musk unveiled Grok 4 in a concise live demo that highlighted the model’s claimed superiority. The author sets out to compare the launch performance with actual usage across several dimensions.
Claims Made at the Launch
First‑principles reasoning : Grok 4 is said to solve problems by reasoning from the most fundamental principles rather than applying preset solutions.
Human‑Level Examination (HLE) : Tested on a 100+‑field, PhD‑level dataset where most answers are not publicly available. Gemini 2.5 Pro achieved 26% accuracy, while Grok 4 reached 35‑45%.
Demo Performance Highlights
The model impressed the audience with:
Reasoning : Top scores on mathematics, logic and scientific benchmarks, even achieving a perfect score on the AIME exam.
Programming : Outperformed Claude 3.5 Sonnet in a simulated black‑hole‑collision coding demo and is slated to launch a dedicated Grok Coding model that allegedly surpasses Claude 4 Opus on SWE‑Bench.
Multimodal : Supports image input/output and real‑time voice interaction, holding its own against GPT‑4o in a voice‑battle.
DeepResearch : Can synthesize large amounts of web information into long documents.
Learning : Ranked second (behind OpenAI o3 pro) on the ARC‑AGI learning benchmark.
Agent capability : On the Vending‑Bench dataset, Grok 4 scored three times higher than Claude Opus 4, making it a strong candidate for building autonomous agents.
Practical Availability
Grok 4 is accessible via the Grok and X homepages. A $30 /month “SuperGrok” subscription is required for basic use, while a $300 /month “SuperGrok Heavy” plan unlocks a Heavy mode with a built‑in multi‑agent system.
Real‑World Evaluation
3.1 Programming Test
The author reproduced a classic “ball‑rolling” HTML/CSS/JS task using the following prompt:
请生成一个完整的 HTML文件(将 HTML、CSS 和 JavaScript均嵌入单一文件中),模拟一个红色小球在顺时针缓慢旋转的正五边形内部弹跳的动画。要求:
-小球应受重力影响,并在碰到边界时发生反弹;
-小球与多边形之间的碰撞检测要真实;
所有代码应包含在<html>文件内,不要引用外部库或文件;动画要平滑,页面布局适配Grok 4’s generated code contained syntax errors and failed to run, requiring manual correction. In contrast, DeepSeek‑R1 produced correct, runnable code, as shown in the side‑by‑side screenshots.
3.2 Reasoning Test
A classic pirate‑division logic puzzle was posed. After ten minutes of thinking, Grok 4 gave an incorrect answer, while DeepSeek‑R1 responded quickly and correctly.
3.3 DeepResearch Test
A complex espionage‑network query requiring synthesis of scattered clues was submitted. Grok 4 searched 238 webpages over ten minutes and produced the correct answer.
3.4 Long‑Form Writing
Grok 4 demonstrated the ability to generate coherent analysis reports and mid‑length fiction without noticeable issues.
Summary and Outlook
Overall, Grok 4 shows impressive scores in reasoning, multimodal interaction, deep research, and agent tasks, but its coding ability lags behind competitors such as DeepSeek‑R1. The model’s rapid development—four months from Grok 3.5 to Grok 4—suggests a strong reinforcement‑learning pipeline backed by substantial compute resources (≈200 k H100 GPUs). Upcoming releases of Grok Coding, multi‑agent systems, and video‑generation models may address current shortcomings.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
