Which LLM Generates the Best SQL? A 19‑Model Benchmark on a 200M‑Row GitHub Dataset
This article presents a comprehensive benchmark of 19 large language models (plus a human baseline) on generating analytical SQL queries over a 200 million‑row GitHub events dataset, detailing the methodology, metrics, results, and practical guidance for using LLMs in data analysis.
Why the Benchmark?
Tinybird relies on LLMs to power its analytics UI, so the quality of generated SQL directly impacts user experience. The benchmark was created to continuously evaluate model performance on real‑world, noisy data, catch regressions, understand trade‑offs between accuracy, latency, and cost, and guide prompting and model selection.
Dataset: GitHub Archive
The source data is the public GitHub Archive – a collection of every GitHub event since 2011, amounting to billions of rows. For the benchmark a random 200 million‑row subset was materialized as a single ClickHouse MergeTree table in Tinybird, containing mixed types such as enums, low‑cardinality strings, timestamps, and arrays.
SQL Questions
Fifty natural‑language prompts were derived from the ClickHouse maintainer list, e.g., “Top 10 repositories by stars in the past 7 days”. Each model had to parse the request, produce syntactically correct SQL, execute it on the table, and return the correct result set.
System Prompt
Do not fabricate fields; queries referencing non‑existent columns are invalid.
Avoid SELECT * unless explicitly requested.
Disallow ORDER BY and LIMIT.
Return only the SQL statement, no explanations.
Optimize for minimal rows_read.
Full table schema is provided in the prompt.
Provide a list of available fields and compatible functions.
Models Tested
Claude 3.5 & 3.7 Sonnet (Anthropic)
Gemini 1.5 Pro (Google)
GPT‑4 Turbo (2024‑03) (OpenAI)
LLaMA 3 70B / 8B (Meta)
Mistral 7B & Mixtral‑1 (Mistral)
DeepSeek Chat v3 (paid and free)
Additional models will be added as new releases appear.
Performance Metrics
Effective Query Rate : % of prompts that yielded a valid, executable SQL.
First‑Try Success Rate : % of prompts that succeeded on the first generation attempt.
Execution Time : query latency in milliseconds.
LLM Generation Time : time the model spent producing the response.
Attempts : number of retries needed to obtain a valid query.
Rows Read : rows scanned per execution.
Data Read : bytes scanned per execution.
Query Length : token count of the generated SQL.
Derived scores combine correctness and efficiency:
Efficiency Score : accounts for generation speed, retry count, and query execution efficiency (rows/read, latency).
Accuracy Score : measures semantic similarity to the human‑written reference using Jaccard distance, numeric RMSE, and F‑score.
Overall Score : average of efficiency and accuracy scores.
The raw efficiency calculation is implemented in the benchmark repository (see result-validator.ts on GitHub).
Key Findings
Claude leads in accuracy (but slower)
Claude 3.7 ranks first overall; Claude 3.5 is third.
Both achieve 100 % effective query rate on the first try and >90 % generation success.
Average accuracy scores around 52‑56.
Generation latency ≈ 3.2 s, rows read ≈ 37‑40 million per query.
OpenAI o3/o4 models are strong overall
o3‑mini ranks second; o4‑mini is sixth.
100 % effective queries, ~88‑92 % first‑try success.
Accuracy ≈ 51‑55, latency < 700 ms.
Rows read ≈ 49‑52 million per query.
LLaMA 3 performs poorly, LLaMA 4 recovers
LLaMA 3 70B is last (19th) with 66 % effective queries and 35.6 accuracy.
LLaMA 4 Maverick and Scout improve to 96‑100 % effectiveness; Scout reads the least data (≈ 129 MB/query) with accuracy 44‑48.
Gemini falls behind due to latency
Gemini 2.5 Pro reaches 91.8 % accuracy but needs ~40 s per sample.
Flash versions are faster but have lower accuracy (~40‑42 %).
Human‑written queries remain the most efficient
Humans read ≈ 31 million rows and 760 MB per query.
No model matches the cost‑effectiveness; most models read 1.5‑2× more rows.
Guidance for Practitioners
Inject schema awareness : supply table structures, relationships, and data types in the prompt.
Implement query validation : check generated SQL for syntax errors, performance pitfalls, and security issues before execution.
Balance penalties : weight efficiency, accuracy, and generation time according to your use‑case.
Close the feedback loop : capture results and error messages to iteratively improve prompts.
Use specialized models per stage : separate models for query generation, visualization suggestion, and result explanation.
Enforce structured output : require the model to wrap responses in a predefined JSON schema.
Self‑evaluation checks : have the model assess its own output against defined criteria before returning it.
Build context‑aware prompt templates : include user history, prior queries, and domain constraints.
Even with strong LLMs, SQL remains a necessary skill; understanding model limitations and maintaining human oversight are essential for reliable analytics.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
