How to Write Elegant, High‑Performance SQL for Big Data Pipelines
This article shares practical techniques for writing clean, efficient SQL in large‑scale data environments, covering predicate pushdown, sub‑queries, deduplication strategies, bucket optimization, and automation with Python‑Spark integration to improve readability and execution speed.
Background
During four years of data development the author observed that some SQL scripts are concise and performant, while others span hundreds of lines, contain repetitive sub‑queries and scattered filter conditions, making them hard to read and maintain.
Methods and Techniques
1. Sub‑queries and predicate push‑down
When filter predicates are placed in the WHERE clause after a full join, the engine cannot push them down to the source tables, leading to full scans and possible data skew. Moving the predicates into sub‑queries allows early row pruning.
// Bad code example
SELECT f1.pin,
c1.site_id,
c2.site_name
FROM fdm.fdm1 AS f1
LEFT JOIN cdm.cdm1 AS c1 ON f1.erp = LOWER(c1.account_number)
LEFT JOIN cdm.cdm2 AS c2 ON c1.site_id = c2.site_code
WHERE f1.start_date <= '2024-01-01'
AND f1.end_date > '2024-01-01'
AND f1.status = 1
AND c1.dt = '2024-01-01'
AND c2.yn = 1
GROUP BY f1.pin, c1.site_id, c2.site_name;Issues:
Filters are applied after the join, preventing push‑down.
NULL‑supplement tables cause full scans and possible skew.
// Good code example
SELECT f1.pin,
c1.site_id,
c2.site_name
FROM (SELECT erp, pin FROM fdm.fdm1 WHERE dp = 'ACTIVE' AND status = 1) f1
LEFT JOIN (SELECT site_id, LOWER(account_number) AS account_number FROM cdm.cdm1 WHERE dt = '2024-01-01') c1
ON f1.erp = c1.account_number
LEFT JOIN (SELECT site_code, site_name FROM cdm.cdm2 WHERE yn = 1) c2
ON c1.site_id = c2.site_code
GROUP BY f1.pin, c1.site_id, c2.site_name;Moving predicates into each sub‑query filters rows early, improves readability and reduces data processed.
2. Deduplication techniques
Common approaches are GROUP BY, DISTINCT, window function ROW_NUMBER() and COLLECT_SET. On a 2‑billion‑row table the following test shows the performance difference:
// Using DISTINCT (faster)
SELECT COUNT(DISTINCT ulp_base_age)
FROM app.app1
WHERE dt = SYSDATE(-1);
// Using GROUP BY (slower)
SELECT COUNT(ulp_base_age)
FROM (SELECT ulp_base_age FROM app.app1 WHERE dt = SYSDATE(-1) GROUP BY ulp_base_age) t;Result: DISTINCT finishes in ~3 minutes, while the GROUP BY version takes ~5 minutes because the latter incurs an extra shuffle and disk I/O.
3. Maximizing bucket utilization
Aggregating a large click‑stream table with COUNT(DISTINCT pin) per day sends all data to a single bucket, causing imbalance and possible OOM.
// Naïve approach (single bucket)
SELECT click_dt,
COUNT(DISTINCT pin) AS uv
FROM log_table
GROUP BY click_dt;Pre‑aggregate with COLLECT_SET to create a bucket per day:
// Better – pre‑aggregate with collect_set
SELECT click_dt,
SIZE(COLLECT_SET(pin)) AS uv
FROM (SELECT click_dt, pin FROM log_table GROUP BY click_dt, pin) tmp
GROUP BY click_dt;Further improvement adds a length‑based bucket, spreading data across multiple reducers:
// Even better – bucket by pin length
SELECT click_dt,
SUM(uv_tmp) AS uv
FROM (SELECT LEN(pin) AS len_pin,
click_dt,
SIZE(COLLECT_SET(pin)) AS uv_tmp
FROM (SELECT click_dt, pin, LENGTH(pin) AS len_pin FROM log_table) sub
GROUP BY len_pin, click_dt) agg
GROUP BY click_dt;Adding len_pin creates multiple buckets, enabling parallel processing and avoiding OOM.
4. Leveraging platform tools for parameter handling
In Python‑Spark scripts the date range can be supplied via sys.argv. The script determines the appropriate start and end dates and passes them to Hive/Spark SQL.
if len(sys.argv) == 1:
# No external parameters – use yesterday and today
curday = ht.oneday(0)
today = datetime.datetime.strptime(curday, '%Y-%m-%d')
start_date = (today + datetime.timedelta(days=-1)).strftime('%Y-%m-%d')
end_date = today.strftime('%Y-%m-%d')
elif len(sys.argv) == 2:
# Single date supplied – treat as end_date, compute start of month
end_date = sys.argv[1]
start_date = (datetime.datetime.strptime(end_date, '%Y-%m-%d')
.replace(day=1)).strftime('%Y-%m-%d')
elif len(sys.argv) == 3:
# Explicit start and end dates for back‑fill
start_date = sys.argv[1]
end_date = sys.argv[2]
else:
print('parameter error')
sys.exit(1)The parameters are then used in an ht.exec_sql call with Spark configuration flags that enable adaptive execution, skew handling and resource tuning:
ht.exec_sql(
schema_name='app',
sql=showsql.format(htYDay_B=start_date, htYDay=end_date),
table_name='app1',
exec_engine='spark',
spark_resource_level='high',
retry_with_hive=False,
spark_args=[
'--conf spark.sql.hive.mergeFiles=true',
'--conf spark.sql.adaptive.enabled=true',
'--conf spark.sql.adaptive.repartition.enabled=true',
'--conf spark.sql.adaptive.join.enabled=true',
'--conf spark.sql.adaptive.skewedJoin.enabled=true',
'--conf spark.hadoop.hive.exec.orc.split.strategy=ETL',
'--conf spark.sql.shuffle.partitions=1200',
'--conf spark.driver.maxResultSize=8g',
'--conf spark.executor.memory=32g'
]
)This automation reduces manual date handling, prevents errors during back‑fill, and ensures consistent resource configuration.
Conclusion
Applying predicate push‑down via sub‑queries, selecting the appropriate deduplication method (prefer DISTINCT when the distinct column cardinality is low), designing bucket keys that distribute data evenly, and automating date parameters with platform‑provided tools together produce SQL that is both readable and performant, saving development effort and cluster resources.
Reference
https://www.jhelp.net/p/foafP0Vuwt7Qaa12
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
