Big Data 12 min read

Mastering ETL: 8 Essential Algorithms for Modern Data Warehouses

This article explains why ETL is a critical step in building data warehouses, introduces eight core ETL algorithms—including full delete/insert, upsert, append, and various link‑table models—describes their ideal use cases, and provides ready‑to‑run SQL code examples for each.

Huawei Cloud Developer Alliance

Sep 15, 2020

Mastering ETL: 8 Essential Algorithms for Modern Data Warehouses

Why ETL Matters in Data Warehousing

Data warehouses support enterprise‑wide decision making by consolidating all types of data. While cloud providers now offer mature tools, developers must still master the technical side, especially the ETL (Extract, Transform, Load) process that moves data from source systems into the warehouse.

ETL Overview

ETL extracts required data from source systems, cleans and transforms it, and loads it into a predefined warehouse model, turning fragmented, inconsistent data into a unified source for analysis.

ETL Algorithm Overview

Eight ETL algorithms are grouped into four major categories. The most common are incremental accumulation and link‑table (chain) algorithms, though full delete/insert and upsert are also widely used.

Full Delete/Insert Model

Use case: Loading dimension tables, parameter tables, or master data where the source provides a complete snapshot and only the latest full data is needed.

Implementation logic:

Clear the target table.

Insert all rows from the source table.

-- 1. Clean target table
TRUNCATE TABLE <target_table>;
-- 2. Full insert
INSERT INTO <target_table> (columns...)
SELECT columns...
FROM <source_table>
JOIN <related_data>
WHERE ...;

Upsert (Incremental Full) Model

Use case: Loading parameter or master tables where the source may be incremental or full, and the target must always hold the latest records.

Implementation logic:

Compare primary keys.

Update matching rows.

Insert rows that do not exist.

-- 1. Generate staging table
CREATE TEMP TABLE <temp_table> AS
SELECT columns... FROM <source_table>
JOIN <related_data>
WHERE ...;
-- 2. Merge into target
MERGE INTO <target_table> AS T
USING <temp_table> AS S
ON (T.PK = S.PK)
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (columns...) VALUES (S.columns...);

Append (Incremental Accumulation) Model

Use case: Loading transaction or event tables where each day's data is appended to preserve the full history.

-- 1. Insert into target
INSERT INTO <target_table> (columns...)
SELECT columns...
FROM <source_table>
JOIN <related_data>
WHERE ...;

Full History Link (Chain) Model

Concept: A link table contains a primary key, change‑tracking fields, a start date, and an end date to record the validity period of each row.

Benefit: Enables fast retrieval of data valid on any given date and reduces storage overhead for slowly changing dimensions.

Example query: Retrieve rows valid on 2020‑02‑05.

SELECT *
FROM <target_table>
WHERE start_date <= DATE '2020-02-05'
AND end_date > DATE '2020-02-05';

Incremental Link Model

Use case: Track incremental changes by opening a new chain for each primary‑key change.

Implementation logic:

Extract yesterday's open‑chain records.

Compare PKs with today’s source.

Close old chains and open new ones for changed rows.

Insert new rows for new PKs.

-- 1. Extract current valid records
INSERT INTO <temp_pre>
SELECT ... FROM <target_table>
WHERE end_date = DATE '<max_date>';
-- 2. Extract today’s source records
-- 3. Identify changed rows and close old chains
-- 4. Open new chains for changes and new PKs
-- (SQL statements follow the same pattern as shown above)

Delete‑Aware Link Model

Use case: Track deletions in incremental data by using business fields to indicate removed rows.

Implementation logic:

Extract yesterday’s open‑chain data.

Identify source rows marked as deleted.

Close the corresponding chains in the target.

-- 1. Clean target table
TRUNCATE TABLE <target_table>;
-- 2. Full insert (as baseline)
INSERT INTO <target_table> (columns...)
SELECT columns... FROM <source_table>
JOIN <related_data>
WHERE ...;

Other Considerations

Best practices suggest adding control fields (insert date, update date, source) to all tables to further trace data changes. ETL algorithms can be customized beyond the standard models to meet specific business needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL Data Warehouse ETL algorithms

Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.