Mastering ETL: 8 Essential Algorithms for Modern Data Warehouses
This article explains why ETL is a critical step in building data warehouses, introduces eight core ETL algorithms—including full delete/insert, upsert, append, and various link‑table models—describes their ideal use cases, and provides ready‑to‑run SQL code examples for each.
Why ETL Matters in Data Warehousing
Data warehouses support enterprise‑wide decision making by consolidating all types of data. While cloud providers now offer mature tools, developers must still master the technical side, especially the ETL (Extract, Transform, Load) process that moves data from source systems into the warehouse.
ETL Overview
ETL extracts required data from source systems, cleans and transforms it, and loads it into a predefined warehouse model, turning fragmented, inconsistent data into a unified source for analysis.
ETL Algorithm Overview
Eight ETL algorithms are grouped into four major categories. The most common are incremental accumulation and link‑table (chain) algorithms, though full delete/insert and upsert are also widely used.
Full Delete/Insert Model
Use case: Loading dimension tables, parameter tables, or master data where the source provides a complete snapshot and only the latest full data is needed.
Implementation logic:
Clear the target table.
Insert all rows from the source table.
-- 1. Clean target table
TRUNCATE TABLE <target_table>;
-- 2. Full insert
INSERT INTO <target_table> (columns...)
SELECT columns...
FROM <source_table>
JOIN <related_data>
WHERE ...;Upsert (Incremental Full) Model
Use case: Loading parameter or master tables where the source may be incremental or full, and the target must always hold the latest records.
Implementation logic:
Compare primary keys.
Update matching rows.
Insert rows that do not exist.
-- 1. Generate staging table
CREATE TEMP TABLE <temp_table> AS
SELECT columns... FROM <source_table>
JOIN <related_data>
WHERE ...;
-- 2. Merge into target
MERGE INTO <target_table> AS T
USING <temp_table> AS S
ON (T.PK = S.PK)
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (columns...) VALUES (S.columns...);Append (Incremental Accumulation) Model
Use case: Loading transaction or event tables where each day's data is appended to preserve the full history.
-- 1. Insert into target
INSERT INTO <target_table> (columns...)
SELECT columns...
FROM <source_table>
JOIN <related_data>
WHERE ...;Full History Link (Chain) Model
Concept: A link table contains a primary key, change‑tracking fields, a start date, and an end date to record the validity period of each row.
Benefit: Enables fast retrieval of data valid on any given date and reduces storage overhead for slowly changing dimensions.
Example query: Retrieve rows valid on 2020‑02‑05.
SELECT *
FROM <target_table>
WHERE start_date <= DATE '2020-02-05'
AND end_date > DATE '2020-02-05';Incremental Link Model
Use case: Track incremental changes by opening a new chain for each primary‑key change.
Implementation logic:
Extract yesterday's open‑chain records.
Compare PKs with today’s source.
Close old chains and open new ones for changed rows.
Insert new rows for new PKs.
-- 1. Extract current valid records
INSERT INTO <temp_pre>
SELECT ... FROM <target_table>
WHERE end_date = DATE '<max_date>';
-- 2. Extract today’s source records
-- 3. Identify changed rows and close old chains
-- 4. Open new chains for changes and new PKs
-- (SQL statements follow the same pattern as shown above)Delete‑Aware Link Model
Use case: Track deletions in incremental data by using business fields to indicate removed rows.
Implementation logic:
Extract yesterday’s open‑chain data.
Identify source rows marked as deleted.
Close the corresponding chains in the target.
-- 1. Clean target table
TRUNCATE TABLE <target_table>;
-- 2. Full insert (as baseline)
INSERT INTO <target_table> (columns...)
SELECT columns... FROM <source_table>
JOIN <related_data>
WHERE ...;Other Considerations
Best practices suggest adding control fields (insert date, update date, source) to all tables to further trace data changes. ETL algorithms can be customized beyond the standard models to meet specific business needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Huawei Cloud Developer Alliance
The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
