Big Data 10 min read

Migrating a Petabyte-Scale Big Data Platform to Alibaba Cloud: Architecture, Challenges, and Lessons Learned

This article details the end‑to‑end migration of a petabyte‑scale big‑data platform to Alibaba Cloud, describing the DSS synchronization system, its integration with Hive Metastore and Airflow, the gray‑release strategy, data‑consistency validation using Presto, and key takeaways for future cloud migrations.

Liulishuo Tech Team

Dec 31, 2020

Migrating a Petabyte-Scale Big Data Platform to Alibaba Cloud: Architecture, Challenges, and Lessons Learned

Background: Moving petabyte‑scale data is a major engineering effort that must remain transparent to upstream services. In October 2020, Liulishuo completed a 21‑day migration of its big‑data platform to Alibaba Cloud, and this article outlines the architecture, issues, and insights from that project.

Current Situation: Unlike online services that rely on cache or database replication tools (e.g., cloud DTS), big‑data workloads separate metadata (Hive Metastore) from actual files stored in object storage (e.g., S3). Existing industry solutions lack a unified product that synchronizes both metadata and files atomically.

Engineering Architecture: The migration required a custom Data Synchronization System (DSS) built on Juicedata’s open‑source juicesync tool, which supports cross‑cloud file sync and provides high performance. DSS updates the Location field of Hive Metastore objects via its API and writes the changes to the target cloud Metastore.

Key DSS Features:

Calculate daily changed partitions per table and synchronize them; for non‑partitioned tables, synchronize based on schema.

Repair (synchronize) specific tables, partitions, or all partitions of a partitioned table.

Pause or resume synchronization on demand, enabling controlled gray‑release.

Integration with Airflow: A callback in an Airflow DAG triggers DSS after a compute task finishes, automatically syncing output tables. The workflow includes a sub‑system called Janus that extracts lineage information from Hive or Spark SQL to identify output tables, then queries the Metastore for partition changes.

Example SQL used by Janus:

INSERT OVERWRITE TABLE dw.dw_zh_user_tag(dt='${dt}')
SELECT a.user_id,
       'c1' AS tag,
       'c2' AS tag_comment
FROM dw.temp_dw_zh_user_tag_inc a
LEFT JOIN dw.temp_dw_zh_user_tag_inc_07 b
ON a.user_id = b.user_id
WHERE ...
GROUP BY 1,2,3;

Janus outputs lineage information such as:

inserted_tables:['dw.dw_zh_user_tag'], input_tables:['dw.temp_dw_zh_user_tag_inc','dw.temp_dw_zh_user_tag_inc_07']

Using the identified output tables, DSS queries the Metastore to retrieve partition change metadata:

SELECT PART_NAME, LOCATION
FROM DBS a
JOIN TBLS b ON a.`DB_ID` = b.DB_ID
JOIN PARTITIONS p ON p.TBL_ID = b.TBL_ID
JOIN SDS s ON b.SD_ID = s.SD_ID
JOIN (
  SELECT PART_ID, PARAM_VALUE
  FROM PARTITION_PARAMS
  WHERE PARAM_KEY = 'transient_lastDdlTime' AND PARAM_VALUE >= %d AND PARAM_VALUE <= %d
) pp ON p.`PART_ID` = pp.PART_ID
WHERE a.name = %r AND b.`TBL_NAME` = %r
ORDER BY PART_NAME DESC;

Gray Release Strategy: ETL jobs run nightly (00:00‑05:00). During this window, results are not synced to OSS, so DSS provides a switch to pause synchronization. During the day, developers’ changes are synced in near‑real time; after nightly validation, the switch is turned on to propagate the new data, analogous to filling a reservoir and opening a gate.

Data‑Consistency Validation: After both clouds finish ETL, a coarse table‑count comparison filters out mismatched tables, then a CRC32 checksum and Presto federation queries quickly identify row‑level differences across tens of thousands of partitions. Issues such as non‑idempotent functions (e.g., row_number, collect_list) and double‑type aggregation inconsistencies were uncovered.

Example Presto comparison query:

SELECT plan_id,
       trigger_cnt,
       trigger_dcnt,
       related_cnt,
       related_dcnt,
       success_cnt,
       success_dcnt,
       fail_cnt,
       fail_dcnt,
       ttl_cnt,
       count(1)
FROM (
  SELECT * FROM aliyun.dw_adl.public_msg_plan_report WHERE dt = '${dt}'
  UNION ALL
  SELECT * FROM aws.dw_adl.public_msg_plan_report WHERE dt = '${dt}'
) t
GROUP BY plan_id, trigger_cnt, trigger_dcnt, related_cnt, related_dcnt, success_cnt, success_dcnt, fail_cnt, fail_dcnt, ttl_cnt
HAVING count(1) = 1;

Conclusion: The migration demonstrated that event‑driven hooks on Hive Metastore combined with a custom DSS can achieve reliable metadata and file synchronization at PB scale. Data inconsistency proved to be the biggest challenge, mitigated by Presto‑based diff queries and lineage‑driven debugging. Currently, no single product fully addresses these migration needs, but cloud providers are expected to bundle such capabilities as more enterprises move to the cloud.

Author: Dong Yajun, Tech Lead, Data Engineering Team, Liulishuo.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Migration gray-release Data synchronization presto Hive Metastore Big Data Migration DSS

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.