DeepSeek V4 Pro vs GPT‑5.3 Codex High: Direct Code‑Generation Test Reveals the Gap

A two‑stage evaluation compares DeepSeek V4 Pro and GPT‑5.3 Codex High on a TypeScript LRU‑Cache task and a markdown‑inspection CLI project, showing DeepSeek leads on basic code correctness while GPT‑5.3 delivers a more complete engineering solution, with detailed scores and analysis.

Top Architecture Tech Stack
Top Architecture Tech Stack
Top Architecture Tech Stack
DeepSeek V4 Pro vs GPT‑5.3 Codex High: Direct Code‑Generation Test Reveals the Gap

Test Design

The evaluation uses a two‑layer approach. The first layer tests basic TypeScript coding ability with an LRU‑Cache implementation. The second layer tests end‑to‑end agent engineering ability by building a small CLI called md‑inspector that recursively scans Markdown files and produces a quality report.

Round 1 – LRU Cache

Requirements:

O(1) get / put operations

Configurable capacity, including handling capacity = 0 Full source code

Five test cases covering edge conditions

Scoring after three interaction rounds:

DeepSeek V4 Pro : first attempt 8.2 → final 9.0

GPT‑5.3 Codex High : first attempt 7.8 → final 8.6

DeepSeek V4 Pro initial solution used a Map plus a doubly‑linked list, then added:

Generic type parameters

Constructor validation for non‑negative integer capacity

Additional API methods size, has, clear Vitest test suite covering boundary cases

Explicit complexity notes

Separated link‑node and data‑node types to avoid the unsafe cast null as unknown as K API semantics: get returns V | undefined; tryGet returns an object with found: true/false to distinguish cache‑miss from a cached undefined GPT‑5.3 Codex High also started with a canonical Map + linked‑list design and then upgraded:

Replaced circular sentinel design

Removed unsafe null as unknown cast

Added tests for illegal capacities (NaN, Infinity, negative numbers, floating‑point values)

Structured hit result as {hit, value} Weaknesses noted for GPT‑5.3 were fewer regression tests, less detailed design rationale, and slightly weaker engineering narration.

Round 2 – Markdown CLI ( md‑inspector )

Task: implement a TypeScript CLI that recursively scans a directory of Markdown files and outputs a quality report. Real‑world edge cases required handling:

Empty directories and non‑existent directories

Missing or multiple H1 headings

Image links that should not be counted as normal links

Links inside fenced code blocks that should be ignored

Cross‑platform path handling (Windows vs macOS/Linux)

File‑read failures that must produce warnings instead of crashing

Constraints:

Use only Node built‑in modules

Reasonable file splitting

At least eight Vitest tests

Clear execution and verification instructions

Self‑review step at the end

Scoring:

GPT‑5.3 Codex High : 8.7 (rank 1) – described as the most mature code agent

DeepSeek V4 Pro : 8.0 (rank 2) – usable initial project but less stable in finalization

GPT‑5.3 strengths :

Explicitly stated requirement assumptions and implementation plan (requirements → initialization → module decomposition → error handling → testing → self‑review)

Modular project structure (scanner, analyzer, path handling, report generator, entry point)

Test coverage exceeded the minimum with ten Vitest tests

All npm test and npx tsc --noEmit checks passed

CLI error semantics matched the specification: missing directories produce JSON‑formatted warnings rather than crashes

GPT‑5.3 weaknesses :

Markdown parsing based on regular expressions instead of an AST

Custom word‑count assumptions without external justification

Limited cross‑platform failure testing

Coarse error handling during the scanning phase

DeepSeek V4 Pro weaknesses :

TypeScript compilation failed ( npx tsc --noEmit) due to missing @types/node, causing type‑resolution errors for node:fs/promises and process Error semantics used stderr + exit instead of the required JSON warning format

Insufficient tolerance for scanning‑phase failures

End‑to‑end CLI behavior tests were limited, focusing more on internal modules than on actual command‑line execution

Combined Findings

Basic code generation and first‑answer correctness: DeepSeek V4 Pro performed better.

Engineering closure, test completeness, and delivery stability: GPT‑5.3 Codex High outperformed.

Overall ranking: GPT‑5.3 Codex High > DeepSeek V4 Pro, with the gap attributable to engineering finish quality rather than a fundamental capability gap.

Claude Code Context

Claude Code is a code‑agent capable of reading and writing project files, executing commands, running tests/builds, iteratively fixing issues, and maintaining goal consistency across multi‑step task chains. This capability explains why single‑answer correctness is insufficient; stable multi‑step progress determines real development experience.

TypeScriptAgent EngineeringDeepSeek V4 ProGPT-5.3 Codex HighLLM code evaluationMarkdown CLI
Top Architecture Tech Stack
Written by

Top Architecture Tech Stack

Sharing Java and Python tech insights, with occasional practical development tool tips.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.