Artificial Intelligence 9 min read

DeepSeek‑R1 Upgrade: Does Its Coding Ability Match Claude 4? – In‑Depth Model Evaluation

The DeepSeek‑R1‑0528 model released on May 28 2025 shows major gains in coding, function‑calling and long‑text generation, with benchmark scores that surpass Qwen3‑235B, approach Claude 4 in programming, and include detailed hands‑on prompts and results.

Fun with Large Models

May 30, 2025

DeepSeek‑R1 Upgrade: Does Its Coding Ability Match Claude 4? – In‑Depth Model Evaluation

On May 28 2025 DeepSeek announced the small‑version upgrade to its R1 model, now labeled DeepSeek‑R1‑0528. The new version can be accessed through the official website, app or mini‑program. The release follows a wave of major AI model announcements, including Qwen3, Google Gemini Pro, and Anthropic Claude 4, and continues DeepSeek’s reinforcement‑learning‑after‑training pipeline.

Official report : DeepSeek states that R1‑0528 improves inference, coding, QA and long‑text writing. It can adapt its reasoning chain length, answering simple questions with minimal steps while spending up to ~20 minutes on complex problems. The model adds native FunctionCalling (MCP) capability, though it does not invoke functions within the reasoning chain. Benchmark figures show R1‑0528 outperforming Qwen3‑235B across all metrics and exceeding Gemini‑2.5‑Pro‑0506 in programming, while trailing OpenAI‑O3 slightly.

Community evaluation : Independent testers used opencompass and evalscope on more than 40 datasets (Math500, AIME25, GPQA, etc.). Results indicate that DeepSeek‑R1‑0528 reaches coding performance close to Claude 4, with a slight gap in overall scores.

FunctionCalling performance : On the ifeval dataset R1‑0528 achieved a score of 0.8795, surpassing DeepSeek‑V3‑0324, Qwen3‑235B and even edging Claude 4, demonstrating stronger multi‑tool orchestration and accurate demand decomposition.

Practical test cases :

Prompt: "Create a responsive corporate website for ‘糖糖科技’" – the model generated thousands of lines of HTML/CSS/JS that ran without errors and matched the intended design.

Prompt: "Write a three.js simulation of the solar system" – the model produced a complete interactive 3D page, allowing camera view switches.

Prompt: "Build a complex visual effect page with a particle galaxy, black‑hole simulation, quantum entanglement and interstellar travel" – the model output over 1,200 lines of code, executed bug‑free and fulfilled the visual requirements.

MCP (Multi‑tool Calling) ability : By adapting Qwen‑Agent examples, the author replaced Qwen API calls with DeepSeek‑R1‑0528 endpoints. The model could seamlessly select and invoke multiple tools and dozens of external functions, producing a coherent report after reasoning.

Text generation : R1‑0528 shows longer, more coherent outputs with better logical flow. In a long‑form report on the male facial‑mask market, the model delivered smoother narrative and richer content, reducing the typical “AI‑generated” feel.

Conclusion : DeepSeek‑R1‑0528 delivers strong performance across coding, function calling and text generation, rivaling top‑tier models such as Claude 4 and Gemini. Its steady technical progress suggests DeepSeek will remain a significant player in the evolving AI‑agent landscape.

AI agents DeepSeek benchmark model evaluation Function Calling coding ability R1-0528

Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.