Big Data 11 min read

What Happens Behind the Scenes When a SQL Query Runs in a Big Data Platform?

This article walks through the end‑to‑end lifecycle of a SQL task in a big‑data environment, covering creation, scheduling metadata, instance generation, resource allocation, ODPS execution, and final processing on the Fuxi distributed engine.

Alibaba Cloud Developer

Apr 25, 2024

What Happens Behind the Scenes When a SQL Query Runs in a Big Data Platform?

As a newcomer to big data development, the author explores the complete lifecycle of a SQL task, from creation in the DataPhin IDE to execution on the ODPS platform.

1. Overall Process

The article begins with a flow diagram illustrating the end‑to‑end steps of a SQL job, using a simple example that counts prize distribution per activity.

2. Task Development and Deployment

In the IDE, a new offline periodic task is created on the DataPhin development page, and the following SQL is written:

SELECT prize_id,
       COUNT(*) AS prize_send_cnt_1d
FROM   apcdm.dwd_ap_mkt_eqt_send_di
WHERE  dt = '${bizdate}'
  AND  prize_id IN ('PZ169328936', 'PZ169298703')
GROUP BY prize_id;

After writing the SQL, scheduling metadata such as task ID, name, node type, and owner are configured, along with parameters like business date, cron expression, and dependency information (see table).

Basic Information

Task node ID, name, type, owner

Scheduling Parameters

Biz date (previous day), schedule time, etc.

Scheduling Attributes

Instance generation, type, effective dates, retry policy, period, cron expression

Dependencies

Upstream/downstream node relationships

Node Context

Input and output parameters

Execution Info

Engine and resource group

The task is then submitted, optionally passing a smoke test in the development environment before publishing.

3. Instance Generation

At the scheduled time (e.g., 22:00), the Phoenix scheduler compiles task definitions into executable instances, builds a DAG based on lineage and time dependencies, and resolves cron expressions.

4. Resource Allocation

The Alisa execution engine allocates slots from resource groups to the task. Gateways submit jobs to ODPS, and slot management ensures priority for critical business and multi‑tenant fairness.

5. ODPS Job Execution

Submitted jobs enter ODPS’s control layer (Worker, Scheduler, Executor). The Scheduler creates instances, breaks them into tasks, and places them in a priority queue. Executors poll the queue, receive tasks, and perform SQL parsing, logical and physical planning.

6. Physical Execution on Fuxi

The physical plan is transformed into a DAG of Fuxi tasks. The Fuxi Master schedules these tasks on agents, which launch worker processes that read data, perform computation, and write results back.

When all workers finish, the result is written back, the Application Master reports completion, resources are released, and the task status is updated to SUCCESS, ready for the next scheduling cycle.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline SQL Task scheduling ODPS Fuxi

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.