Artificial Intelligence 15 min read

How GPT‑5, DeepSeek‑V3.1 and SQLShift Stack Up in the August 2025 SQL LLM Benchmark

The August 2025 SCALE benchmark evaluates new AI models—including the GPT‑5 family, DeepSeek‑V3.1, and the SQLShift tool—across SQL understanding, optimization, and dialect conversion, revealing distinct strengths, weaknesses, and the growing advantage of specialized tools over generic large language models.

Aikesheng Open Source Community

Sep 4, 2025

How GPT‑5, DeepSeek‑V3.1 and SQLShift Stack Up in the August 2025 SQL LLM Benchmark

1. Monthly Overview and Key Highlights

In August 2025, the SCALE benchmark continues tracking AI frontiers, introducing several high‑profile new models and, for the first time, a dedicated dialect‑conversion tool—SQLShift—into the evaluation baseline to provide developers and decision‑makers with more comprehensive, practice‑oriented references.

GPT‑5 series capability divergence The new GPT‑5 family shows distinct characteristics: gpt‑5‑mini balances accuracy, reliability, and complex‑task handling, leading overall; gpt‑5‑nano excels as a code generator with extremely high syntax correctness; the flagship gpt‑5‑chat boasts rich theoretical knowledge but falls short on basic execution accuracy, reinforcing the view that “model value is defined by scenario.”

Domestic new‑model performance DeepSeek‑V3.1 demonstrates balanced strength in SQL understanding, optimization, and dialect conversion, performing well in domestic database conversion but still needing improvement on ultra‑long queries and deep optimization.

Professional tool first evaluation The specialized “dialect‑conversion application” SQLShift is evaluated for the first time, showing strong performance in its core domain and opening a performance showdown between general LLMs and specialized tools.

2. Evaluation Criteria

To ensure fairness, depth, and reproducibility, the three‑dimensional SCALE evaluation framework is used, with all models and tools tested in a standardized environment.

SQL Understanding : assesses whether a model can accurately parse complex query logic and user intent.

SQL Optimization : evaluates a model’s awareness and ability to improve query efficiency and performance.

Dialect Conversion : tests a model’s accuracy in migrating syntax between mainstream databases.

3. New Models and Applications Added This Issue

The following models and applications were added to ensure timeliness and frontier relevance:

4. Focus Analysis

Topic 1: GPT‑5 Series First Evaluation

GPT‑5 mini: Balanced Leader

Overall Evaluation : gpt‑5‑mini leads with balanced performance across dimensions, making it the top choice for enterprise‑grade stable output.

Strengths : High accuracy in SQL understanding (score 82.0) and strong dialect‑conversion error detection (score 92.9).

Weaknesses : Sub‑optimal optimization (score 66.2) and limited handling of long, complex queries (score 58.1); lacks deep understanding of programming paradigms.

GPT‑5 nano: High‑Precision Code Generator

Overall Evaluation : gpt‑5‑nano excels as an “SQL code generator,” suitable for automated workflows and standard text‑to‑SQL tasks.

Strengths : Perfect syntax‑error detection (100 pts) and solid logical conversion.

Weaknesses : Limited grasp of execution plans (score 35.7) and struggles with complex large‑SQL migrations (score 58.1). Notable code snippet issues: type, key, filtered.

GPT‑5 chat: Divergent Capabilities

Overall Evaluation : gpt‑5‑chat underperforms overall, showing strong theoretical knowledge but poor basic execution accuracy.

Strengths : High scores in SQL optimization error detection (94.7).

Weaknesses : Low execution accuracy (57.1) and inability to correctly identify result types ( select vs table_state); poor handling of large‑SQL migrations (score 51.6) and syntax incompatibilities such as ON CONFLICT support.

Topic 2: DeepSeek‑V3.1 Evaluation

Overall Evaluation : DeepSeek‑V3.1 shows balanced overall strength with no obvious weak points, yet performance can improve.

Strengths : Robust syntax‑error detection and adherence to best practices; accurate in domestic database conversion scenarios.

Weaknesses : Struggles with ultra‑long, complex large‑SQL conversions (score 25.8) and lacks deep insight into execution plans and optimization depth.

Topic 3: Dialect‑Conversion Application – SQLShift

The specialized tool SQLShift focuses on dialect conversion, delivering high‑precision results for teams with demanding database migration needs.

Evaluation Dimension : Core “dialect conversion” capability.

Performance : Achieves a perfect 100 pts in domestic database conversion, demonstrating deep adaptation to the Chinese database ecosystem.

Large‑SQL Conversion : Scores 67.7, surpassing the second‑place o4‑mini (61.3) and becoming the first tool to guarantee high logical consistency in this scenario.

5. Monthly Leaderboard Review

SQL Understanding

The top spot is held by Google’s Gemini series , with Gemini 2.5 Flash scoring 82.3. GPT‑5 mini follows closely at 82.0, demonstrating strong execution accuracy.

SQL Optimization

Specialized optimizer SQLFlash leads with 88.5, while gpt‑5‑nano enters the top five thanks to its high syntax reliability.

Dialect Conversion

First‑time participant SQLShift tops the chart with 83.4, especially excelling in large‑SQL conversion; gpt‑5‑mini ranks second with 72.4.

6. Summary and Outlook

This month’s evaluation introduced the GPT‑5 family, DeepSeek‑V3.1, and the professional tool SQLShift, enriching the depth and breadth of the SCALE leaderboard. Results reaffirm that while general LLMs continue to improve, specialized tools retain decisive advantages in niche database‑dialect tasks, making them the optimal choice for performance‑critical enterprise applications.

Future plans include ongoing tracking of frontier models and tools, refining evaluation scenarios, and incorporating more real‑world enterprise use cases.

7. Expert Commentary

Yin Haiwen , database expert (Oracle ACE, PostgreSQL ACE), author of “胖头鱼的鱼缸”.

Compared with the first SCALE list, the biggest change this issue is the introduction of GPT‑5 and DeepSeek‑V3.1. Although they show progress in general capabilities, they have not overtaken the previous leaders in SQL understanding and optimization. Notably, GPT‑5 mini surpasses the prior benchmark in dialect conversion, and SQLShift demonstrates dominant performance as a dedicated tool, confirming that specialized solutions can excel alongside powerful LLMs.

“AI for DB” is entering a practical stage, requiring deep integration of AI with database knowledge to build intelligent management foundations.

Further reading: “My Ten Years in the Database Industry”

Thank you for following! We aim to provide core LLM‑SQL capability assessments. Suggestions and model requests are welcome.

References

SCALE 202508: https://sql-llm-leaderboard.com/ranking/2025-08

SQLShift: https://sqlshift.cn/

SCALE: Choose professional AI models for professional SQL tasks.

SQL AI LLM DeepSeek benchmark GPT-5 SQLShift

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.