Artificial Intelligence 7 min read

Gemini 3.1 Pro Doubles Reasoning Scores, Beats Claude and GPT on ARC‑AGI‑2

Google’s Gemini 3.1 Pro achieves a 148% jump to 77.1% on the ARC‑AGI‑2 benchmark, scores a perfect 100% on AIME 2025, outperforms Claude Opus 4.6 and GPT‑5.2 on abstract reasoning, while offering 1 M‑token context, real‑time code demos, and immediate platform rollout.

ShiZhen AI

Feb 20, 2026

Gemini 3.1 Pro Doubles Reasoning Scores, Beats Claude and GPT on ARC‑AGI‑2

Benchmark Highlights

Google DeepMind released official numbers showing Gemini 3.1 Pro reaches 77.1% on ARC‑AGI‑2, a 148% increase over the previous 31.1% version. It also scores 100% on the AIME 2025 mathematics competition, 80.6% on SWE‑Bench Verified, 2887 Elo on LiveCodeBench Pro, and 94.3% on GPQA Diamond.

Comparison with Competitors

Against Claude Opus 4.6 (68.8% on ARC‑AGI‑2) and GPT‑5.2 (52.9%), Gemini 3.1 Pro leads in abstract reasoning. On the “Humanity’s Last Exam” benchmark Claude slightly edges Gemini (53.1% vs 51.4%). Code ability is within 1% across models, while Gemini leads terminal programming by about 3 points.

Direct Visual Comparison

Google posted a side‑by‑side video showing faster response and higher quality from 3.1 Pro versus 3.0 on the same tasks.

Practical Demos

City Planner Application

The model generated a full city‑planning app handling complex terrain, road and infrastructure layout, traffic flow simulation, and 3D visualization, integrating constraints and multi‑objective optimization without requiring the user to write code.

3‑D Flocking Simulation

A demo of starling flocking produced code implementing boids algorithms, hand‑gesture interaction, dynamic audio, and 3‑D rendering, illustrating “creative programming” beyond business logic.

Real‑time Code Verification

During a code‑review session, Gemini 3.1 Pro detected uncertainty about a SQL query and automatically launched a PostgreSQL container to test it, demonstrating a “try‑and‑verify” reasoning style.

Long‑Context Performance

While the model supports up to 1 M tokens input and 64 k tokens output, measured performance drops sharply: 128 k context yields 84.9% accuracy, but 1 M context falls to 26.3%, indicating current limits for ultra‑long texts.

Availability

Gemini 3.1 Pro is already live on Gemini App (free trial), Google AI Studio, Vertex AI, and NotebookLM (Pro/Ultra subscription). API pricing matches the previous Gemini 3 Pro.

References

Google DeepMind model page: https://deepmind.google/models/gemini/pro/

Sundar Pichai announcement: https://x.com/sundarpichai/status/2024516418855981298

Google tweet: https://x.com/google/status/2024519455389192204

3.0 vs 3.1 comparison video: https://x.com/google/status/2024598510499426433

Gemini API documentation: https://ai.google.dev/gemini-api/docs/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large language models model comparison AI benchmarks Google DeepMind Gemini 3.1 Pro AIME 2025 ARC-AGI-2

Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.