Big Data 14 min read

Techniques for Writing Elegant and Efficient SQL in Big Data Environments

The article shares practical methods and code examples for making SQL both readable and high‑performing in large‑scale data platforms, covering predicate push‑down with subqueries, deduplication strategies, bucket utilization, and Python‑driven job parameter handling.

JD Tech
JD Tech
JD Tech
Techniques for Writing Elegant and Efficient SQL in Big Data Environments

After four years of data‑development work, the author observed that some SQL statements are clean and easy to understand while others become tangled and hard to maintain, prompting a collection of techniques that improve both readability and execution performance.

1. Subqueries and Predicate Push‑Down

Many developers write joins first and place all filter conditions in a large WHERE clause, which prevents predicate push‑down and can cause full‑table scans and data skew. The article shows a bad example and then a good version that moves filters into subqueries, allowing early row reduction and clearer logic.

// -------------------- Bad Codes ------------------------
SELECT
    f1.pin,
    c1.site_id,
    c2.site_name
FROM
    fdm.fdm1 AS f1
LEFT JOIN cdm.cdm1 AS c1 ON f1.erp = lower(c1.account_number)
LEFT JOIN cdm.cdm2 AS c2 ON c1.site_id = c2.site_code
WHERE
    f1.start_date <= '""" + start_date + """'
    AND f1.end_date > '""" + start_date + """'
    AND f1.status = 1
    AND c1.dt = '""" + start_date + """'
    AND c2.yn = 1
GROUP BY
    f1.pin,
    c1.site_id,
    c2.site_name;
// -------------------- Good Codes ------------------------
SELECT
  f1.pin,
  c1.site_id,
  c2.site_name
FROM
  (
    SELECT erp, pin FROM fdm.fdm1 WHERE dp = 'ACTIVE' AND status = 1
  ) f1
LEFT JOIN (
    SELECT site_id, lower(account_number) AS account_number FROM cdm.cdm1 WHERE dt = '""" + start_date + """'
) c1 ON f1.erp = c1.account_number
LEFT JOIN (
    SELECT site_code, site_name FROM cdm.cdm2 WHERE yn = 1
) c2 ON c1.site_id = c2.site_code
GROUP BY
  f1.pin,
  c1.site_id,
  c2.site_name;

Moving filters into subqueries enables early pruning of rows, improves join efficiency, and makes the data flow easier to follow.

2. Deduplication Challenges

The article compares GROUP BY and DISTINCT for deduplication, showing a benchmark on a 2‑billion‑row table where the DISTINCT version runs faster because GROUP BY incurs extra shuffle and aggregation overhead.

// Good (DISTINCT)
select count(distinct ulp_base_age) from app.app1 where dt = sysdate(-1);

// Bad (GROUP BY)
select count(ulp_base_age) from (
  select ulp_base_age from app.app1 where dt = sysdate(-1) group by ulp_base_age
) t;

For UV counting on a daily basis, the author demonstrates a naïve approach using COUNT(DISTINCT) that overloads a single bucket, then a better approach that first groups by click_dt, pin and uses size(collect_set(pin)) to distribute work across multiple buckets.

// Bad Codes
select click_dt, count(distinct pin) as uv from log_table group by click_dt;

// Good Codes
SELECT click_dt, size(collect_set(pin)) AS uv
FROM (
  SELECT click_dt, pin FROM log_table GROUP BY click_dt, pin
) tmp
GROUP BY click_dt;

An even more optimized version adds a length‑based pre‑aggregation dimension to balance bucket loads and avoid OOM.

// Even Better Codes
SELECT click_dt, SUM(uv_tmp) AS uv
FROM (
  SELECT len_pin, click_dt, size(collect_set(pin)) AS uv_tmp
  FROM (
    SELECT click_dt, pin, LENGTH(pin) AS len_pin FROM log_table
  ) log_table_tmp
  GROUP BY len_pin, click_dt
) tmp
GROUP BY click_dt;

3. Fully Utilizing Platform Tools

The author shares a Python wrapper that parses command‑line arguments to set start_date and end_date for Spark jobs, then calls ht.exec_sql with appropriate engine settings, making daily back‑fill and batch scheduling more reliable.

if len(sys.argv) == 1:
    curday = ht.oneday(0)
    today = datetime.datetime.strptime(curday, '%Y-%m-%d')
    start_date = str((today + datetime.timedelta(days=-1)).strftime("%Y-%m-%d"))[:10]
    end_date = str(today)[:10]
elif len(sys.argv) == 2:
    end_date = str(datetime.datetime.strptime(sys.argv[1], "%Y-%m-%d"))[:10]
    start_date = str((datetime.datetime.strptime(end_date, "%Y-%m-%d")).replace(day=1))[:10]
elif len(sys.argv) == 3:
    start_date = str(datetime.datetime.strptime(sys.argv[1], "%Y-%m-%d"))[:10]
    end_date = str(datetime.datetime.strptime(sys.argv[2], "%Y-%m-%d"))[:10]
else:
    print('parameter error')
    sys.exit(1)

# Execute SQL with Spark settings
ht.exec_sql(
    schema_name='app',
    sql=showsql.format(htYDay_B=start_date, htYDay=end_date),
    table_name='app1',
    exec_engine='spark',
    spark_resource_level='high',
    retry_with_hive=False,
    spark_args=[
        '--conf spark.sql.hive.mergeFiles=true',
        '--conf spark.sql.adaptive.enabled=true',
        '--conf spark.sql.adaptive.repartition.enabled=true',
        '--conf spark.sql.adaptive.join.enabled=true',
        '--conf spark.sql.adaptive.skewedJoin.enabled=true',
        '--conf spark.hadoop.hive.exec.orc.split.strategy=ETL',
        '--conf spark.sql.shuffle.partitions=1200',
        '--conf spark.driver.maxResultSize=8g',
        '--conf spark.executor.memory=32g'
    ]
)

By automating date handling and configuring Spark parameters, developers reduce manual errors and improve job stability.

Overall, understanding SQL optimization principles, applying predicate push‑down, choosing appropriate deduplication methods, leveraging bucket strategies, and integrating platform tooling enable data engineers to write cleaner, faster, and more maintainable code.

Data Engineeringperformanceoptimizationbig dataSQLHiveSpark
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.