7 Reusable DuckDB SQL Patterns for Fast Local Data Analysis
This article presents seven practical DuckDB SQL patterns—querying files directly, treating partition folders as tables, deduplicating with QUALIFY, computing rolling metrics with window functions, pivot/unpivot, handling JSON arrays, and exporting results to Parquet—plus tips and a mini case study that show how to turn a notebook into a lightweight OLAP engine without leaving the Python environment.
DuckDB can act as a small OLAP engine on a laptop, letting you run powerful SQL queries without leaving the Python environment. The author shares seven reusable SQL patterns that are concise, efficient, and require little configuration.
Core workflow
Think of the analysis pipeline as File → DuckDB SQL → Small result set → Python processing . By querying Parquet/CSV/JSON files directly, you filter and aggregate early, then import only the trimmed result into a DataFrame, keeping the notebook responsive.
Pattern 1 – Query files like tables (push‑down)
Directly query local files so DuckDB performs column projection and row filtering before data reaches Python.
SELECT user_id, SUM(amount) AS total_spend
FROM read_parquet('data/transactions/*.parquet')
WHERE tx_date BETWEEN DATE '2025-01-01' AND DATE '2025-03-31'
AND country = 'CN'
GROUP BY user_id
ORDER BY total_spend DESC
LIMIT 20;Core value: read_parquet + WHERE let DuckDB read only the necessary row groups and columns—warehouse‑level behavior on a local disk. The same technique works with read_csv_auto() and read_json_auto().
Python bridge (returning a compact result):
import duckdb, pandas as pd
q = """
SELECT user_id, SUM(amount) AS total_spend
FROM read_parquet('data/transactions/*.parquet')
WHERE tx_date >= DATE '2025-01-01' AND tx_date < DATE '2025-04-01'
GROUP BY user_id ORDER BY total_spend DESC LIMIT 20
"""
df = duckdb.query(q).to_df() # compact, tidy data ready for plottingPattern 2 – Treat partition folders as tables (Hive partitioning)
Automatically turn directory names like country=CN/yyyymm=202501/ into columns.
SELECT country, yyyymm, COUNT(*) AS n, SUM(amount) AS total_amount
FROM read_parquet('data/country=*/yyyymm=*/part-*.parquet', hive_partitioning=1)
WHERE yyyymm BETWEEN '202501' AND '202503'
GROUP BY country, yyyymm
ORDER BY yyyymm, country;Core value: No metadata store is needed; fast, tidy analysis of event logs or upstream tool exports.
Pattern 3 – Keep the latest record per primary key (QUALIFY)
Get the newest record for each entity without a nested sub‑query.
WITH profiles AS (
SELECT * FROM read_parquet('data/user_profiles/*.parquet')
)
SELECT *
FROM profiles
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) = 1;Core value: QUALIFY lets you filter on window‑function results directly, simplifying CDC files, incremental dumps, and messy exports.
Pattern 4 – In‑memory rolling metric calculation
Perform time‑series calculations in SQL instead of Python loops.
WITH s AS (
SELECT tx_date::DATE AS d, SUM(amount) AS daily_rev
FROM read_parquet('data/transactions/*.parquet')
GROUP BY 1
)
SELECT d,
daily_rev,
SUM(daily_rev) OVER (ORDER BY d RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW) AS rev_7d,
(daily_rev - LAG(daily_rev, 7) OVER (ORDER BY d)) AS week_delta
FROM s
ORDER BY d;Core value: Keeps Python focused on visualization; window functions stream‑process with minimal memory.
Pattern 5 – Easy pivot / unpivot
Reshape data for dashboards with a single statement.
WITH daily AS (
SELECT DATE_TRUNC('day', ts) AS d, category, COUNT(*) AS events
FROM read_parquet('data/events/*.parquet')
GROUP BY 1,2
)
PIVOT daily ON category USING SUM(events) GROUP BY d ORDER BY d;
UNPIVOT read_parquet('data/agg/daily_by_category.parquet')
ON COLUMNS(* EXCLUDE d) INTO NAME category VALUE events;Core value: No manual joins or fragile Pandas reshaping; one SQL line produces the desired shape.
Pattern 6 – JSON and list handling: explode, tidy, rebuild
Turn semi‑structured log fields into tidy rows.
WITH orders AS (
SELECT * FROM read_json_auto('data/orders_2025.json')
),
items AS (
SELECT o.order_id,
i->>'sku' AS sku,
CAST(i->>'qty' AS INTEGER) AS qty,
CAST(i->>'price' AS DOUBLE) AS price
FROM orders o, UNNEST(o.items) AS t(i)
)
SELECT sku, SUM(qty) AS units, SUM(qty*price) AS revenue
FROM items
GROUP BY sku
ORDER BY revenue DESC;Then rebuild clean JSON for downstream tools:
SELECT order_id,
to_json(struct_pack(
total_items := SUM(qty),
total_price := SUM(qty*price)
)) AS order_summary_json
FROM items
GROUP BY order_id;Core value: UNNEST converts nested arrays to aggregatable rows; struct_pack / to_json produce lightweight API payloads.
Pattern 7 – Export a clean slice to Parquet for hand‑off
Persist a filtered, aggregated result as a single compressed Parquet file.
COPY (
SELECT user_id,
SUM(amount) AS total_spend,
COUNT(*) AS tx_count
FROM read_parquet('data/transactions/*.parquet')
WHERE tx_date >= DATE '2025-01-01'
GROUP BY user_id
) TO 'out/spend_2025_q1.parquet' (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 128000);Core value: A single compressed Parquet file is easy to share or reload, avoiding repeated global calculations.
Additional tips (often overlooked)
Prefer efficient formats: Parquet > CSV for repeated reads; use COPY (SELECT …) TO 'x.parquet' for one‑shot conversion.
Filter early, extract late: Return only the small result set needed for plotting.
Stabilize schemas: When reading messy JSON, explicitly CAST(... AS DOUBLE) before persisting to Parquet.
Ensure deterministic ordering: Always pair LIMIT with ORDER BY for reproducible top‑N lists.
Build repeatable notebooks: Wrap SQL in small Python functions so a single click re‑runs the whole workflow.
Mini case study (real‑world scenario)
A growth team analyzing a purchase funnel had large CSV dumps that made Pandas sluggish. They switched to Pattern 1 and Pattern 3:
Directly query the Parquet files (they had already converted CSV to Parquet).
Use QUALIFY to keep only the latest customer state.
Result: On a MacBook the end‑to‑end funnel table was generated in about 5 seconds instead of minutes in the cloud, enabling rapid chart updates and doubling iteration speed. No data‑warehouse tickets, Airflow jobs, or waiting were required.
Conclusion
The fastest analysis is the one you can run immediately. DuckDB’s strengths lie in predicate push‑down, tidy reshaping, easy window functions, and seamless integration with Python for visualization or modeling.
Reuse these patterns, adapt them to your data, and let DuckDB turn your notebook into a surprisingly powerful analytical engine.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data STUDIO
Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
