Databases 15 min read

What the Latest SCALE Benchmark Shows About SQL Optimization in GLM‑4.7 and Seed‑OSS‑36B

The January 2026 SCALE benchmark adds an index‑suggestion metric and evaluates two new LLMs—智谱 GLM‑4.7 and 字节跳动 Seed‑OSS‑36B—revealing strengths in dialect conversion, moderate SQL understanding, and notable gaps in complex execution‑plan analysis and practical index recommendations.

Aikesheng Open Source Community

Feb 9, 2026

What the Latest SCALE Benchmark Shows About SQL Optimization in GLM‑4.7 and Seed‑OSS‑36B

Framework Update

The SCALE benchmark evaluates large language models (LLMs) on three core dimensions for professional SQL tasks: SQL Understanding , SQL Optimization , and Dialect Conversion . In the 2026‑01 release a new Index‑Suggestion metric was added to the SQL Optimization dimension. This metric measures whether a model can propose concrete, cost‑effective index recommendations that improve query execution plans, rather than merely rewriting syntax.

SCALE Core Dimensions

SQL Understanding : depth of analysis of SQL intent, logical flow, and execution‑plan inference.

SQL Optimization : ability to transform inefficient queries into semantically equivalent, higher‑performance versions.

Dialect Conversion : accuracy of migrating SQL between different database dialects (e.g., Oracle → PostgreSQL, SQL Server → GaussDB).

Model Evaluation – 智谱 GLM‑4.7

GLM‑4.7 was released by Zhipu AI on 2025‑12‑23 and quickly topped the Hugging Face trend chart. The model is fine‑tuned for code‑generation scenarios and contains 130 B parameters.

SQL Understanding

Score: 79.8 . Execution‑accuracy and syntax‑error detection both reached 82.9 , indicating strong logical consistency. However, the model struggled with complex execution‑plan cues. In queries containing LEFT JOIN, GROUP BY and ORDER BY, GLM‑4.7 frequently mis‑identified join buffers (e.g., outputting Using join buffer (Block Nested Loop)) and omitted critical hints such as Using temporary; Using filesort. Row‑count estimation errors revealed limited understanding of aggregation and multi‑table driving‑table selection.

SQL Optimization & Index Suggestion

Overall optimization score: 59.6 . Index‑suggestion sub‑score: 58.1 . The model can generate basic index advice that follows primary design principles (e.g., adding an index on columns used in WHERE clauses). In more nuanced scenarios—redundant index detection, low‑selectivity columns, or balancing maintenance cost—the suggestions become unstable or sub‑optimal.

Dialect Conversion

Score for Chinese‑origin database migration: 89.5 . This reflects GLM‑4.7’s traditional strength in handling domestic dialects. Conversely, fine‑grained syntax‑error detection scored only 50 , exposing gaps in version‑specific syntax knowledge (e.g., OceanBase 4.2.5, GaussDB‑v2.0).

Model Evaluation – 字节跳动 Seed‑OSS‑36B

Seed‑OSS‑36B was open‑sourced by ByteDance on 2025‑08‑21. It uses causal language modeling, grouped‑query attention, and RoPE positional encoding, with 36 B parameters, a 512 k token context window, and a 150 k vocabulary.

SQL Understanding

Score: 55.2 . Syntax‑error detection is high ( 88.6 ), showing the model can catch subtle dialect conflicts and misspellings. Execution‑accuracy, however, is low ( 48.6 ). For example, the model mis‑classifies the result type of a SELECT statement as table_state instead of select, and confuses SELECT with DML operations ( INSERT/UPDATE/DELETE), indicating weak semantic categorisation of SQL statement types.

SQL Optimization & Index Suggestion

Overall optimization score: 55.3 . Index‑suggestion sub‑score: 53.8 . Recommendations are overly conservative and often include unnecessary ID columns, fail to recognise redundant existing indexes, and miss implicit type‑conversion issues that render an index ineffective. The model lacks a global optimisation perspective and does not adequately analyse the context of existing indexes.

Dialect Conversion

Score: 55.0 . The model performs well on short, standard SQL but degrades sharply on large‑SQL transformations (e.g., 19.4 on the “large SQL conversion” test). Notable failures include:

Oracle → PostgreSQL: incorrectly maps GET DIAGNOSTICS and PL/SQL constructs ( TYPE ... IS RECORD, %NOTFOUND, SQL%ROWCOUNT) to PL/pgSQL.

SQL Server → GaussDB: retains the + string‑concatenation operator instead of the standard ||, and misplaces the target table in UPDATE ... FROM statements, leading to missing join conditions and Cartesian products.

These errors demonstrate insufficient handling of procedural language differences, operator mapping rules, and complex DML semantics.

Practical Guidance for Using LLMs in SQL Workflows

Always validate model‑generated index suggestions with an EXPLAIN plan and adjust based on actual cost estimates.

For complex stored procedures or multi‑step queries, split the logic into smaller, well‑defined chunks before feeding them to the model ("logical chunking").

Treat model output as a first‑draft reference; a DBA should review and fine‑tune before production deployment.

Combine syntax‑checking capabilities (high in Seed‑OSS‑36B) with execution‑cost optimisation (stronger in GLM‑4.7) to cover both correctness and performance.

Observed Trends

The benchmark shift from pure syntactic correctness to combined syntax + execution‑cost optimisation highlights the growing importance of physical‑structure advice (e.g., index design).

Domestic Chinese database migration capabilities are maturing rapidly; models now achieve near‑human scores on Chinese‑origin dialect conversion.

References

2026‑01 ranking: https://sql-llm-leaderboard.com/ranking/2026-01

GLM‑4.7 release notes: https://z.ai/blog/glm-4.7

GLM‑4.7 Hugging Face repository: https://huggingface.co/zai-org/GLM-4.7

Seed‑OSS‑36B Hugging Face repository: https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Base

GitHub benchmark code: https://github.com/actiontech/sql-llm-benchmark

SQL database optimization AI benchmarking LLM evaluation dialect conversion index suggestion

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.