How Transaction Table2.0 Cuts Data Deduplication Costs by 98% in MaxCompute
This article explains how Renliji's data warehouse team leveraged MaxCompute's Transaction Table2.0 to dramatically reduce incremental data deduplication costs and execution time, while also introducing efficient small‑file merging, time‑travel queries, and future data‑sync strategies for a high‑growth HR SaaS platform.
Business Overview
Renliji, founded by Alibaba DingTalk and Renliwo, provides HR SaaS services such as personnel management, payroll, social security, and value‑added services, serving e‑commerce and retail customers. As a fast‑growing startup with multiple products, each product’s data is independent, creating a challenge for the data‑warehouse team to deliver stable, accurate, and timely data while optimizing compute costs.
Key Pain Points with MaxCompute
When using Alibaba Cloud MaxCompute, the team observed four main reasons for rising incremental‑data deduplication costs:
Incremental data volume is small (MB level) compared with historical data (GB level).
Historical data is recomputed each day even though only a tiny portion changes.
Window‑based deduplication using row_number requires merging yesterday’s full data with today’s increments, leading to high compute cost (≈4.63 CNY per run).
Full data pull from the business DB each day creates heavy load on the source database.
Transaction Table2.0 Deduplication Improvement
MaxCompute introduced Transaction Table2.0 (released 2023‑06‑27) which supports near‑real‑time incremental‑full data storage and computation. The data‑warehouse team adopted its primary‑key model for deduplication, implementing the following steps:
Perform daily window‑based deduplication on incremental user base information.
Filter out records with null business primary keys because the primary‑key table requires non‑null keys.
Insert the deduplicated incremental data directly into the primary‑key table; the system automatically deduplicates by business key.
Overall Comparison
Ordinary table: deduplication SQL execution time 151 s, estimated cost 4.63 CNY. Transaction Table2.0: execution time 72 s, estimated cost 0.06 CNY.
Small‑File Merging Strategies
Transaction Table2.0 supports near‑real‑time incremental writes and time‑travel queries, which can generate many small files. Two merging mechanisms are provided:
Clustering : merges DeltaFiles into larger files without changing data content; the system triggers it automatically based on file size and count.
Compaction : merges all data files according to a strategy, keeping only the latest version of each primary‑key row; the resulting BaseFile does not support time‑travel but improves query efficiency. Compaction can be triggered manually or automatically via table properties.
Because the primary‑key table does not immediately merge small files, the team recommends either manual merging after INSERT INTO or configuring automatic Clustering. For a single daily data load, Clustering is preferred.
Time‑Travel Queries and Data Repair
Transaction Table2.0 supports:
TimeTravel : query data as of a specific timestamp or version.
Incremental : query data changes within a time or version range.
Example queries:
select * from mf_tt2 timestamp as of '2023-06-26 09:33:00' where dd='01' and hh='01'; show history for table mf_tt2 partition(dd='01',hh='01'); select * from mf_tt2 version as of 2 where dd='01' and hh='01'; select * from mf_tt2 timestamp between '2023-06-26 09:31:40' and '2023-06-26 09:32:00' where dd='01' and hh='01'; select * from mf_tt2 version between 2 and 3 where dd='01' and hh='01';For data repair, the workflow is to query the full data snapshot via TimeTravel, insert it into a temporary table, truncate the current primary‑key table, and then insert the corrected data back.
Notes and Future Plans
Dynamic Hard Delete : Hard deletion of historical data is not directly supported; soft delete or periodic full re‑insertion can be used. Flink‑CDC + Flink‑SQL can perform real‑time hard deletes, but per‑table CDC tasks are heavy.
Storage Impact : The primary‑key model occupies slightly more storage than partitioned tables, but the daily storage cost is negligible compared with the saved compute cost.
Flink‑CDC Integration : Enables near‑real‑time data sync, improving data freshness.
Whole‑Database Sync : Anticipated integration of Alibaba Cloud Real‑Time Computing Flink CDAS syntax with MaxCompute for full‑library sync and DDL propagation.
Materialized Views : Combining materialized views with Flink‑CDC can achieve efficient query acceleration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
