Big Data 15 min read

Fundamental Data Skills and Complex Query Techniques in MaxCompute

The article teaches developers essential MaxCompute data‑processing skills—from creating and naming tables, handling strings and dates, and writing basic SELECTs, joins, and aggregations, to employing advanced techniques such as temporary tables, CTEs, partitioning, and map‑join hints for efficient complex queries.

DaTaobao Tech

Oct 11, 2023

Fundamental Data Skills and Complex Query Techniques in MaxCompute

This article introduces two essential data‑processing skills: basic data concepts and complex query techniques, primarily using the MaxCompute (formerly ODPS) platform.

Background – Modern projects demand increasingly large and varied data, from dashboards for A/B experiments to deep analysis. The gap between data demand and production capability motivates a concise guide for developers.

Basic Data Skills

Table Basics – Tables are the start and end of data processing. Understanding how to read and create tables is fundamental.

Creating a temporary table:

-- Temporary tables should be prefixed with "tmp_"
-- Set a reasonable lifecycle to avoid waste
CREATE TABLE tmp_ut_cart_clk LIFECYCLE 7 AS
SELECT user_id
FROM <用户浏览数据表>;

Creating a permanent table (example):

-- analytics_dw is the project name; table name follows
CREATE TABLE IF NOT EXISTS analytics_dw.ads_tb_biz_request_opt_1d (
    bucket_id STRING COMMENT '分桶',
    os STRING COMMENT '系统',
    uv BIGINT COMMENT '分桶用户数',
    pv BIGINT COMMENT '页面访问pv',
    page_stay_time BIGINT COMMENT '页面停留时间（ms）',
    ...
) PARTITIONED BY (
    ds STRING COMMENT '日期'
) LIFECYCLE 30;

Note: PARTITIONED BY defines partition fields, which dramatically improve query speed.

Table Naming Conventions – Four common prefixes:

dim_ : dimension tables for joins.

dwd_ : cleaned detail tables.

dws_ : lightly aggregated tables.

ads_ : application‑layer tables for reporting.

Typical naming pattern for an ads table:

<project>.ads_<business>_<sub‑business>_<function>_<period>

Basic Queries

SET odps.sql.mapper.split.size=2048; -- increase split size if needed
SELECT user_id, page, time_stamp
FROM <App用户使用明细表>
WHERE ds = '${bizdate}'
  AND product = '<App名称>'
  AND event_type = '<浏览|点击>'
  AND page = '<页面标识>';

If the query exceeds the instance limit, increase split.size to reduce the number of instances.

String Processing

SELECT IF(page = 'Page_XXX', 'y', 'n') AS is_page_xxx,
       CASE WHEN hh <= 12 THEN '上午'
            WHEN hh > 12 AND hh <= 18 THEN '下午'
            ELSE '晚上' END AS time_period,
       KEYVALUE(args, ',', '=', 'itemid') AS item_id,
       SPLIT(value, '_')[0] AS first_part,
       COALESCE(a, b, c, '') AS cleaned,
       IF(bi_udf:bi_yt_compare_version(app_version, '10.24.10') >= 0, 'y', 'n') AS target_version,
       GET_JSON_OBJECT(json_str, '$.section.item.name') AS item_name,
       CAST(user_id AS BIGINT) AS user_id,
       TOUPPER(os), TOLOWER(os);

Date Handling

-- Convert string to date
TO_DATE('20230807', 'yyyyMMdd')
-- Add/subtract days
DATEADD(TO_DATE('20230807', 'yyyyMMdd'), 7, 'dd')   -- 20230814
DATEADD(TO_DATE('20230807', 'yyyyMMdd'), -7, 'dd') -- 20230731
-- Difference in days
DATEDIFF(TO_DATE('20230807','yyyyMMdd'), TO_DATE('20230806','yyyyMMdd'), 'dd') -- 1
-- Extract hour
DATEPART(TO_DATE('2023-08-07 12:13:22','yyyy-MM-dd hh:mi:ss'), 'hh');

Join Queries

SELECT a.user_id, a.arg1, a.args, b.bucket_id
FROM (
    SELECT user_id, arg1, args FROM <用户手淘行为表>
) a
LEFT JOIN (
    SELECT user_id, bucket_id FROM <AB实验分流表>
) b ON a.user_id = b.user_id;

Tips: clean and deduplicate tables before joining; for large‑small joins, consider /* + mapjoin(J2) */ to boost performance.

Aggregation

SELECT province, SUM(amount) AS gmv
FROM <每日成交表>
GROUP BY province
UNION ALL
SELECT '整体' AS province, SUM(amount) AS gmv
FROM <每日成交表>;

For multi‑dimensional aggregation, use GROUPING SETS or CUBE:

SELECT IF(GROUPING(province)=0, province, 'all') AS province,
       IF(GROUPING(city)=0, city, 'all') AS city,
       SUM(amount) AS gmv
FROM <每日成交表>
GROUP BY GROUPING SETS((), (province), (province, city));

Complex Query Strategies

Break down large queries into temporary tables, ODPS scripts, or CTEs. Example of a temporary‑table workflow:

DROP TABLE IF EXISTS tmp_step1_${bizdate};
CREATE TABLE tmp_step1_${bizdate} LIFECYCLE 3 AS
SELECT a, b, c FROM <数据表1> WHERE <条件>;

DROP TABLE IF EXISTS tmp_step2_${bizdate};
CREATE TABLE tmp_step2_${bizdate} LIFECYCLE 3 AS
SELECT a, b, c FROM tmp_step1_${bizdate};

SELECT * FROM tmp_stepN_${bizdate} GROUP BY xxx;

CTE (recommended) example:

WITH step1 AS (
  SELECT XX FROM XXXX
),
step2 AS (
  SELECT YY FROM step1
)
INSERT OVERWRITE TABLE <存储表名> PARTITION (ds='${bizdate}')
SELECT * FROM stepN;

Conclusion

Data processing is a learnable skill for developers and product engineers. Mastering the basics—table creation, naming, simple and complex queries—enables efficient data‑driven decision making.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering SQL ETL MaxCompute

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.