GPT-5.3 Codex vs Claude Opus 4.6: Late‑Night Showdown for the Programming Champion

Anthropic and OpenAI released Claude Opus 4.6 and GPT‑5.3‑Codex within minutes, prompting a detailed side‑by‑side analysis of their programming abilities, long‑context windows, agentic features, benchmark scores, pricing, and real‑world use‑case recommendations.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
GPT-5.3 Codex vs Claude Opus 4.6: Late‑Night Showdown for the Programming Champion

Claude Opus 4.6 – Core Capabilities

Programming improvements : more cautious planning (thinks before coding), greater persistence (maintains state over longer agentic tasks), stability on large codebases (less likely to get lost in complex projects), enhanced self‑correction (better at code review and debugging).

Anthropic engineers report that Opus 4.6 spends more time on the hardest parts and quickly passes the easy ones.

Agent Teams – Multi‑Agent Collaboration

Claude Code now lets users launch multiple agents that collaborate in parallel and self‑coordinate.

Anthropic built a 100 k‑line C compiler with 16 simultaneous Claude agents, using git for task locking and code sync. The project required about 2 000 Claude sessions and $20 k, and the resulting compiler supports x86, ARM, RISC‑V, and can compile Linux 6.9, QEMU, FFmpeg, SQLite, PostgreSQL, Redis, and even run Doom.

Claude's C compiler repository
Claude's C compiler repository

Million‑Token Context (Beta)

Opus 4.6 is the first Opus model with a 1 M‑token context window. In the MRCR v2 long‑context retrieval test it scored 76 % versus 18.5 % for the previous Sonnet 4.5, a qualitative shift in usable context.

"This is a qualitative shift in how much context a model can actually use while maintaining peak performance."

New API Features

Adaptive Thinking : the model decides when to think deeply or skim.

Effort control : four levels (low, medium, high, max) let developers balance speed, cost, and intelligence.

Context Compaction (Beta) : automatic summarisation of old dialogue to avoid hitting the context wall.

128k output tokens : longer single‑request outputs.

Office‑Suite Integration

Claude in Excel : higher performance, automatic planning, unstructured data handling, multi‑step operations in a single request.

Claude in PowerPoint (preview) : reads templates, fonts, slide masters and preserves brand consistency.

Core Benchmark Scores

Terminal‑Bench 2.0 – highest industry score on complex CLI tasks.

Humanity’s Last Exam – leads all frontier models on multidisciplinary reasoning.

GDPval‑AA – +144 Elo over GPT‑5.2 on finance/legal knowledge work.

BrowseComp – top score on online information retrieval.

MRCR v2 (1M) – 76 % on long‑context retrieval.

ARC‑AGI‑2 – 68.8 % (max effort, 120k thinking) on general‑intelligence test.

Safety

Low rates of mis‑alignment behaviors (deception, flattery, abuse compliance).

Lowest over‑rejection rate among generations.

Six new security probes added to assess network‑security capability.

Claude assists in discovering and patching open‑source vulnerabilities (see Anthropic security blog).

GPT‑5.3‑Codex – Speed‑First Code Specialist

Positioned as the fastest and most accurate coding model.

Faster and Cheaper

25 % faster than GPT‑5.2‑Codex while using fewer tokens, reducing cost and latency for high‑frequency API calls.

Mid‑Task Steering

Allows users to intervene and adjust the task direction mid‑execution without losing context.

Self‑Participating in Training

Early versions were used to debug training code, manage deployments, and diagnose test results.

Core Benchmark Scores

Terminal‑Bench 2.0 – 77.3 % (13 points higher than GPT‑5.2).

SWE‑bench Pro – 56.8 % on public code‑fix tasks.

OSWorld‑Verified – 64.7 % on computer‑use agents.

CVEBench – 90 % on security‑vulnerability discovery.

GDPval – 70.9 % on 44 professional‑knowledge tasks.

Terminal‑Bench, created by Stanford, evaluates CLI agents on real‑world complex tasks; GPT‑5.3‑Codex’s 77.3 % is a strong result.

The model also passed three high‑threshold network‑security assessments, earning a “high network capability” rating.

Pricing Comparison

Claude Opus 4.6 – $25 per million input tokens within a 200 k context window, $37.5 per million beyond.

GPT‑5.3‑Codex – pricing not yet published; GPT‑5.2 offered $14/M for standard mode and $28/M for high‑priority mode.

Speed Perception

Community tests report that GPT‑5.3‑Codex still feels slower than Claude Opus 4.6, even with the claimed 25 % speed boost.

Anthropic notes that Opus 4.6 may incur higher latency and cost on difficult problems due to deeper thinking; the effort parameter can be lowered to medium to reduce this.

Agent Teams Development Insights

Tests must be extremely high quality; Claude will otherwise solve the wrong problem.

Design feedback from Claude’s perspective: concise, grep‑able, summarised logs to avoid context pollution.

Time‑blindness: Claude does not perceive elapsed time, requiring automatic timeout and progress sampling.

Parallelism needs careful design; if all agents hit the same bug, parallelism fails.

Clear role division among agents (refactoring, performance, documentation) improves efficiency.

"Building this compiler has been some of the most fun I've had recently, but I did not expect this to be anywhere near possible so early in 2026."
code generationbenchmarkingAgentic AIAI model comparisonlarge contextClaude Opus 4.6GPT-5.3 CodeX
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.