Artificial Intelligence 9 min read

Gemini 3.1 Pro Doubles Reasoning Power and Outperforms Claude Opus 4.6

Google's Gemini 3.1 Pro achieves a 77.1% ARC‑AGI‑2 score—more than double its predecessor—leads in multiple benchmark categories, cuts inference cost by half compared to top rivals, and demonstrates advanced multimodal and programming capabilities through real‑world demos.

AI Engineering

Feb 20, 2026

Gemini 3.1 Pro Doubles Reasoning Power and Outperforms Claude Opus 4.6

Google launched Gemini 3.1 Pro, which scored 77.1% on the ARC‑AGI‑2 benchmark, more than twice the 31.1% achieved by Gemini 3 Pro. ARC‑AGI‑2 tests novel logical reasoning rather than memorized questions, and community members note that the jump suggests genuine problem understanding.

Benchmark Performance

Detailed results show breakthroughs across several key areas:

Tool‑use ability rose 82%, with APEX‑Agents increasing from 18.4% to 33.5%.

Gemini 3.1 Pro ranked first on MCP Atlas (69.2%) and BrowseComp (85.9%).

Programming tests: SWE‑Bench Verified 80.6%, Terminal‑Bench 2.0 68.5%.

According to Artificial Analysis v4.0, the model scored 57 on a composite index of ten metrics, reclaiming the top spot and surpassing Claude Opus 4.6 by four points.

In the CritPt physical‑reasoning test, Gemini 3.1 Pro Preview achieved 18%, five percentage points ahead of the runner‑up.

Terminal‑Bench Hard and SciCode placed first with 54% and 59% respectively.

Knowledge and hallucination control improved markedly: AA‑Omniscience hallucination rate dropped from 88% to 50% while accuracy stayed at 53%.

Performance Metrics

Average output speed is 114 tokens/second, about 10 tokens/second slower than the previous generation, yet still among the fastest in the top‑10 intelligent‑index models.

The model retains a 1 million‑token context window, supports tool calls, structured output, and JSON mode. In multimodal understanding, Gemini 3.1 Pro Preview ranked first on MMMU‑Pro, ahead of Gemini 3 Pro Preview and Gemini 3 Flash.

On the GDPval‑AA real‑world task, the ELO score rose over 100 points to 1316, though it remains behind Claude Sonnet 4.6, Opus 4.6, GPT‑5.2 (xhigh) and GLM‑5.

Cost‑Efficiency Breakthrough

Running the full intelligent‑index suite costs $892, less than half the expense of Claude Opus 4.6 (max) and GPT‑5.2 (xhigh), though still about twice the cost of the open‑source GLM‑5.

Despite higher performance, token usage increased only from 56 M to 57 M, adding $72 to cost. Pricing stays at $2 per million input tokens and $12 per million output tokens, with a 1 M‑token context window and 64 k output support; knowledge cutoff is January 2025.

Real‑World Demonstrations

Real‑time ISS tracking dashboard : The model ingests live telemetry via public APIs, builds a responsive UI, and applies physical principles for accurate day‑night cycles.

Code‑generated animation : From a text prompt the model produces SVG animation code that scales without pixelation and results in much smaller file sizes than video.

Example comparison: a user prompted “Create a SVG in HTML of a red Ferrari supercar”. Gemini 3.1 Pro generated smooth, modern‑supercar lines, while Claude Opus 4.6 produced a rounder design that users likened to a Homer‑style car.

Google engineers disclosed a dedicated post‑training optimization for SVG generation.

Interactive 3D simulation : The model simulated a murmuration of starlings, understanding the physics of flocking, responding to hand‑tracking, and generating adaptive music.

Creative coding : Using “Wuthering Heights” as inspiration, the model designed a personal portfolio site, reasoning about the novel’s mood to produce a modern UI and code that captures the characters’ essence.

Community Reaction

Technical communities are impressed by the 82% tool‑use boost and a 2.5× improvement in abstract reasoning, describing the gains as a fundamental capability unlock rather than incremental progress. Some argue that the previous generation had fundamental flaws.

Observers note that Google has moved from a follower to leading 13 of 16 benchmarks within a year, but question whether benchmark dominance translates to stable real‑world performance.

Pricing strategy sparks debate; users claim Google’s rates undercut the high‑price models of OpenAI and Anthropic, suggesting a disruptive shift as AI becomes a commoditized service.

Availability

Gemini 3.1 Pro began rolling out on 3 May. Developers can access it via Google AI Studio, Antigravity, Gemini CLI, and the preview version of Android Studio. Consumer access is limited to the Gemini app and NotebookLM for Google AI Pro and Ultra users with higher quotas.

Industry commentators observe that the AI race is shifting focus from sheer parameter count to genuine reasoning ability, marking the start of practical utility when models truly grasp complex system logic.

cost efficiency AI benchmarks multimodal reasoning Gemini 3.1 Pro Claude Opus 4.6 ARC-AGI-2

Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.