Big Data 14 min read

How to Write Elegant, High‑Performance SQL for Big Data Pipelines

This article shares practical techniques for writing clean, efficient SQL in large‑scale data environments, covering predicate pushdown, sub‑queries, deduplication strategies, bucket optimization, and automation with Python‑Spark integration to improve readability and execution speed.

dbaplus Community

Jan 19, 2025

How to Write Elegant, High‑Performance SQL for Big Data Pipelines

Background

During four years of data development the author observed that some SQL scripts are concise and performant, while others span hundreds of lines, contain repetitive sub‑queries and scattered filter conditions, making them hard to read and maintain.

Methods and Techniques

1. Sub‑queries and predicate push‑down

When filter predicates are placed in the WHERE clause after a full join, the engine cannot push them down to the source tables, leading to full scans and possible data skew. Moving the predicates into sub‑queries allows early row pruning.

// Bad code example
SELECT f1.pin,
       c1.site_id,
       c2.site_name
FROM   fdm.fdm1 AS f1
LEFT   JOIN cdm.cdm1 AS c1 ON f1.erp = LOWER(c1.account_number)
LEFT   JOIN cdm.cdm2 AS c2 ON c1.site_id = c2.site_code
WHERE  f1.start_date <= '2024-01-01'
  AND  f1.end_date   >  '2024-01-01'
  AND  f1.status = 1
  AND  c1.dt = '2024-01-01'
  AND  c2.yn = 1
GROUP  BY f1.pin, c1.site_id, c2.site_name;

Issues:

Filters are applied after the join, preventing push‑down.

NULL‑supplement tables cause full scans and possible skew.

// Good code example
SELECT f1.pin,
       c1.site_id,
       c2.site_name
FROM   (SELECT erp, pin FROM fdm.fdm1 WHERE dp = 'ACTIVE' AND status = 1) f1
LEFT   JOIN (SELECT site_id, LOWER(account_number) AS account_number FROM cdm.cdm1 WHERE dt = '2024-01-01') c1
       ON f1.erp = c1.account_number
LEFT   JOIN (SELECT site_code, site_name FROM cdm.cdm2 WHERE yn = 1) c2
       ON c1.site_id = c2.site_code
GROUP  BY f1.pin, c1.site_id, c2.site_name;

Moving predicates into each sub‑query filters rows early, improves readability and reduces data processed.

2. Deduplication techniques

Common approaches are GROUP BY, DISTINCT, window function ROW_NUMBER() and COLLECT_SET. On a 2‑billion‑row table the following test shows the performance difference:

// Using DISTINCT (faster)
SELECT COUNT(DISTINCT ulp_base_age)
FROM   app.app1
WHERE  dt = SYSDATE(-1);

// Using GROUP BY (slower)
SELECT COUNT(ulp_base_age)
FROM   (SELECT ulp_base_age FROM app.app1 WHERE dt = SYSDATE(-1) GROUP BY ulp_base_age) t;

Result: DISTINCT finishes in ~3 minutes, while the GROUP BY version takes ~5 minutes because the latter incurs an extra shuffle and disk I/O.

3. Maximizing bucket utilization

Aggregating a large click‑stream table with COUNT(DISTINCT pin) per day sends all data to a single bucket, causing imbalance and possible OOM.

// Naïve approach (single bucket)
SELECT click_dt,
       COUNT(DISTINCT pin) AS uv
FROM   log_table
GROUP  BY click_dt;

Pre‑aggregate with COLLECT_SET to create a bucket per day:

// Better – pre‑aggregate with collect_set
SELECT click_dt,
       SIZE(COLLECT_SET(pin)) AS uv
FROM   (SELECT click_dt, pin FROM log_table GROUP BY click_dt, pin) tmp
GROUP  BY click_dt;

Further improvement adds a length‑based bucket, spreading data across multiple reducers:

// Even better – bucket by pin length
SELECT click_dt,
       SUM(uv_tmp) AS uv
FROM   (SELECT LEN(pin) AS len_pin,
               click_dt,
               SIZE(COLLECT_SET(pin)) AS uv_tmp
        FROM   (SELECT click_dt, pin, LENGTH(pin) AS len_pin FROM log_table) sub
        GROUP  BY len_pin, click_dt) agg
GROUP  BY click_dt;

Adding len_pin creates multiple buckets, enabling parallel processing and avoiding OOM.

4. Leveraging platform tools for parameter handling

In Python‑Spark scripts the date range can be supplied via sys.argv. The script determines the appropriate start and end dates and passes them to Hive/Spark SQL.

if len(sys.argv) == 1:
    # No external parameters – use yesterday and today
    curday = ht.oneday(0)
    today = datetime.datetime.strptime(curday, '%Y-%m-%d')
    start_date = (today + datetime.timedelta(days=-1)).strftime('%Y-%m-%d')
    end_date = today.strftime('%Y-%m-%d')
elif len(sys.argv) == 2:
    # Single date supplied – treat as end_date, compute start of month
    end_date = sys.argv[1]
    start_date = (datetime.datetime.strptime(end_date, '%Y-%m-%d')
                  .replace(day=1)).strftime('%Y-%m-%d')
elif len(sys.argv) == 3:
    # Explicit start and end dates for back‑fill
    start_date = sys.argv[1]
    end_date = sys.argv[2]
else:
    print('parameter error')
    sys.exit(1)

The parameters are then used in an ht.exec_sql call with Spark configuration flags that enable adaptive execution, skew handling and resource tuning:

ht.exec_sql(
    schema_name='app',
    sql=showsql.format(htYDay_B=start_date, htYDay=end_date),
    table_name='app1',
    exec_engine='spark',
    spark_resource_level='high',
    retry_with_hive=False,
    spark_args=[
        '--conf spark.sql.hive.mergeFiles=true',
        '--conf spark.sql.adaptive.enabled=true',
        '--conf spark.sql.adaptive.repartition.enabled=true',
        '--conf spark.sql.adaptive.join.enabled=true',
        '--conf spark.sql.adaptive.skewedJoin.enabled=true',
        '--conf spark.hadoop.hive.exec.orc.split.strategy=ETL',
        '--conf spark.sql.shuffle.partitions=1200',
        '--conf spark.driver.maxResultSize=8g',
        '--conf spark.executor.memory=32g'
    ]
)

This automation reduces manual date handling, prevents errors during back‑fill, and ensures consistent resource configuration.

Conclusion

Applying predicate push‑down via sub‑queries, selecting the appropriate deduplication method (prefer DISTINCT when the distinct column cardinality is low), designing bucket keys that distribute data evenly, and automating date parameters with platform‑provided tools together produce SQL that is both readable and performant, saving development effort and cluster resources.

Reference

https://www.jhelp.net/p/foafP0Vuwt7Qaa12

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Hive Spark

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.