Databases 12 min read

7 Reusable DuckDB SQL Patterns for Fast Local Data Analysis

This article presents seven practical DuckDB SQL patterns—querying files directly, treating partition folders as tables, deduplicating with QUALIFY, computing rolling metrics with window functions, pivot/unpivot, handling JSON arrays, and exporting results to Parquet—plus tips and a mini case study that show how to turn a notebook into a lightweight OLAP engine without leaving the Python environment.

Data STUDIO
Data STUDIO
Data STUDIO
7 Reusable DuckDB SQL Patterns for Fast Local Data Analysis

DuckDB can act as a small OLAP engine on a laptop, letting you run powerful SQL queries without leaving the Python environment. The author shares seven reusable SQL patterns that are concise, efficient, and require little configuration.

Core workflow

Think of the analysis pipeline as File → DuckDB SQL → Small result set → Python processing . By querying Parquet/CSV/JSON files directly, you filter and aggregate early, then import only the trimmed result into a DataFrame, keeping the notebook responsive.

Pattern 1 – Query files like tables (push‑down)

Directly query local files so DuckDB performs column projection and row filtering before data reaches Python.
SELECT user_id, SUM(amount) AS total_spend
FROM read_parquet('data/transactions/*.parquet')
WHERE tx_date BETWEEN DATE '2025-01-01' AND DATE '2025-03-31'
  AND country = 'CN'
GROUP BY user_id
ORDER BY total_spend DESC
LIMIT 20;

Core value: read_parquet + WHERE let DuckDB read only the necessary row groups and columns—warehouse‑level behavior on a local disk. The same technique works with read_csv_auto() and read_json_auto().

Python bridge (returning a compact result):

import duckdb, pandas as pd
q = """
SELECT user_id, SUM(amount) AS total_spend
FROM read_parquet('data/transactions/*.parquet')
WHERE tx_date >= DATE '2025-01-01' AND tx_date < DATE '2025-04-01'
GROUP BY user_id ORDER BY total_spend DESC LIMIT 20
"""
df = duckdb.query(q).to_df()  # compact, tidy data ready for plotting

Pattern 2 – Treat partition folders as tables (Hive partitioning)

Automatically turn directory names like country=CN/yyyymm=202501/ into columns.
SELECT country, yyyymm, COUNT(*) AS n, SUM(amount) AS total_amount
FROM read_parquet('data/country=*/yyyymm=*/part-*.parquet', hive_partitioning=1)
WHERE yyyymm BETWEEN '202501' AND '202503'
GROUP BY country, yyyymm
ORDER BY yyyymm, country;

Core value: No metadata store is needed; fast, tidy analysis of event logs or upstream tool exports.

Pattern 3 – Keep the latest record per primary key (QUALIFY)

Get the newest record for each entity without a nested sub‑query.
WITH profiles AS (
  SELECT * FROM read_parquet('data/user_profiles/*.parquet')
)
SELECT *
FROM profiles
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) = 1;

Core value: QUALIFY lets you filter on window‑function results directly, simplifying CDC files, incremental dumps, and messy exports.

Pattern 4 – In‑memory rolling metric calculation

Perform time‑series calculations in SQL instead of Python loops.
WITH s AS (
  SELECT tx_date::DATE AS d, SUM(amount) AS daily_rev
  FROM read_parquet('data/transactions/*.parquet')
  GROUP BY 1
)
SELECT d,
       daily_rev,
       SUM(daily_rev) OVER (ORDER BY d RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW) AS rev_7d,
       (daily_rev - LAG(daily_rev, 7) OVER (ORDER BY d)) AS week_delta
FROM s
ORDER BY d;

Core value: Keeps Python focused on visualization; window functions stream‑process with minimal memory.

Pattern 5 – Easy pivot / unpivot

Reshape data for dashboards with a single statement.
WITH daily AS (
  SELECT DATE_TRUNC('day', ts) AS d, category, COUNT(*) AS events
  FROM read_parquet('data/events/*.parquet')
  GROUP BY 1,2
)
PIVOT daily ON category USING SUM(events) GROUP BY d ORDER BY d;

UNPIVOT read_parquet('data/agg/daily_by_category.parquet')
ON COLUMNS(* EXCLUDE d) INTO NAME category VALUE events;

Core value: No manual joins or fragile Pandas reshaping; one SQL line produces the desired shape.

Pattern 6 – JSON and list handling: explode, tidy, rebuild

Turn semi‑structured log fields into tidy rows.
WITH orders AS (
  SELECT * FROM read_json_auto('data/orders_2025.json')
),
items AS (
  SELECT o.order_id,
         i->>'sku' AS sku,
         CAST(i->>'qty' AS INTEGER) AS qty,
         CAST(i->>'price' AS DOUBLE) AS price
  FROM orders o, UNNEST(o.items) AS t(i)
)
SELECT sku, SUM(qty) AS units, SUM(qty*price) AS revenue
FROM items
GROUP BY sku
ORDER BY revenue DESC;

Then rebuild clean JSON for downstream tools:

SELECT order_id,
       to_json(struct_pack(
         total_items := SUM(qty),
         total_price := SUM(qty*price)
       )) AS order_summary_json
FROM items
GROUP BY order_id;

Core value: UNNEST converts nested arrays to aggregatable rows; struct_pack / to_json produce lightweight API payloads.

Pattern 7 – Export a clean slice to Parquet for hand‑off

Persist a filtered, aggregated result as a single compressed Parquet file.
COPY (
  SELECT user_id,
         SUM(amount) AS total_spend,
         COUNT(*) AS tx_count
  FROM read_parquet('data/transactions/*.parquet')
  WHERE tx_date >= DATE '2025-01-01'
  GROUP BY user_id
) TO 'out/spend_2025_q1.parquet' (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 128000);

Core value: A single compressed Parquet file is easy to share or reload, avoiding repeated global calculations.

Additional tips (often overlooked)

Prefer efficient formats: Parquet > CSV for repeated reads; use COPY (SELECT …) TO 'x.parquet' for one‑shot conversion.

Filter early, extract late: Return only the small result set needed for plotting.

Stabilize schemas: When reading messy JSON, explicitly CAST(... AS DOUBLE) before persisting to Parquet.

Ensure deterministic ordering: Always pair LIMIT with ORDER BY for reproducible top‑N lists.

Build repeatable notebooks: Wrap SQL in small Python functions so a single click re‑runs the whole workflow.

Mini case study (real‑world scenario)

A growth team analyzing a purchase funnel had large CSV dumps that made Pandas sluggish. They switched to Pattern 1 and Pattern 3:

Directly query the Parquet files (they had already converted CSV to Parquet).

Use QUALIFY to keep only the latest customer state.

Result: On a MacBook the end‑to‑end funnel table was generated in about 5 seconds instead of minutes in the cloud, enabling rapid chart updates and doubling iteration speed. No data‑warehouse tickets, Airflow jobs, or waiting were required.

Conclusion

The fastest analysis is the one you can run immediately. DuckDB’s strengths lie in predicate push‑down, tidy reshaping, easy window functions, and seamless integration with Python for visualization or modeling.

Reuse these patterns, adapt them to your data, and let DuckDB turn your notebook into a surprisingly powerful analytical engine.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonSQLdata analysisJSONWindow FunctionsParquetPivotDuckDB
Data STUDIO
Written by

Data STUDIO

Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.