Big Data 12 min read

How We Migrated 100k BigQuery SQL Scripts to MaxCompute Using AST and LLM Automation

This article details a real‑world migration of a Southeast Asian tech group’s data warehouse from Google BigQuery to Alibaba Cloud MaxCompute, describing the challenges of converting 100,000 SQL scripts, the AST‑driven and LLM‑assisted automation pipeline, rule‑engine iteration, quality control, and the measurable performance and cost benefits achieved.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How We Migrated 100k BigQuery SQL Scripts to MaxCompute Using AST and LLM Automation

Project Background

A Southeast Asian technology conglomerate (referred to as GoTerra) needed to migrate a multi‑petabyte data warehouse from Google Cloud BigQuery to Alibaba Cloud MaxCompute for cost, compliance and business integration reasons. The warehouse contained over 100,000 SQL scripts ranging from simple analytics to deeply nested ETL logic. Key challenges were:

High‑quality, rapid conversion of 100k SQL scripts , including functions such as UNNEST, SAFE_CAST, FORMAT_DATE and complex types like ARRAY, STRUCT, BIGNUMERIC.

Extensive use of BigQuery‑specific syntax with no one‑to‑one equivalents in MaxCompute.

Strong coupling between SQL, business rules and data models, making manual rewrites risky.

Manual rewriting was infeasible: at five scripts per engineer per day the effort would require ~20,000 person‑days (≈1 year for a 100‑person team), far exceeding a six‑month launch target.

Solution Overview

The team adopted a “tool‑driven + human‑in‑the‑loop” approach built on Alibaba Cloud Cloud Migration Hub (CMH) / Lake Migration Hub (LMH). The solution combined automated migration, a custom rule engine, AST‑based transformation, large language model (LLM) assistance and manual verification to form an end‑to‑end pipeline.

Migration was split into two phases:

Data Migration : Transfer BigQuery tables (full and incremental) to MaxCompute using online migration tools and MaxCompute Migration Serverless (MMS).

SQL Conversion : Batch‑convert BigQuery SQL to MaxCompute SQL via CMH/LMH, iteratively refine rules, augment with LLM, and apply manual fixes for edge cases.

Technical Implementation

Automated Transformation Engine

AST Parsing : BigQuery SQL is parsed into an abstract syntax tree (AST) using ANTLR or a proprietary parser. Example conversion:

SELECT APPROX_COUNT_DISTINCT(column) → SELECT COUNT(DISTINCT column)

Rule Engine : Over 200 built‑in migration rules cover data‑type mapping (e.g., TIMESTAMP → DATETIME), function replacement ( SAFE_CAST → CAST), and window‑function adjustments. Rules can be extended dynamically based on feedback.

LLM Assistance : For complex or ambiguous statements (custom UDFs, proprietary syntax), a large language model (e.g., Tongyi Qianwen) generates semantic equivalents, such as converting STRUCT to LATERAL VIEW expressions.

Manual Review & Quality Control

Automated Defect Detection : Static analysis and runtime testing on a test cluster flag conversion failures (e.g., type mismatches) and suggest fixes.

Expert Intervention Workflow : Issues are prioritized (P0‑P3) and resolved in an IDE‑based environment with Git version control for traceability.

Knowledge‑Base Feedback Loop : Each manual correction is captured, transformed into a new rule or case, and stored in a reusable migration knowledge base.

Post‑Conversion Validation & Optimization

Functional Verification : Execute converted SQL on MaxCompute and compare results with BigQuery to guarantee business‑logic consistency.

Performance Tuning : Use MaxCompute execution‑plan analysis to identify bottlenecks (full scans, join order) and apply indexing, partition pruning or predicate push‑down.

Continuous Knowledge‑Base Enrichment : Capture performance‑related insights (function‑level optimizations) for future migrations.

Core Transformation Process

Parse : Convert BigQuery SQL to AST, preserving node types.

Transform : Apply rule‑engine replacements and LLM‑generated suggestions. Example mappings:

FORMAT_DATE('%Y-%m-%d', col) → TO_CHAR(col, 'yyyy‑MM‑dd')
UNNEST(array_col) → LATERAL VIEW EXPLODE(array_col)

Generate : Emit MaxCompute‑compatible SQL.

Results & Impact

SQL conversion efficiency increased by >80 %, reducing the migration timeline from months to weeks.

Conversion accuracy exceeded 90 % for most scenarios; only ~10 % of scripts required manual intervention.

Tool‑driven conversion rate rose from 5 % to 80 %, cutting labor costs by over 70 %.

More than 1,000 reusable rules and cases were added to the knowledge base, improving ROI for subsequent projects.

Future Outlook

Extend CMH/LMH to fully automated, multi‑dialect migration (e.g., Snowflake, Databricks).

Expose an online real‑time SQL conversion API for hybrid‑cloud analytics.

Integrate execution‑plan and cost‑model aware automatic SQL optimizer.

Leverage the knowledge base with Retrieval‑Augmented Generation (RAG) to boost LLM consistency and explainability.

Reference

“Mallet: SQL Dialect Translation with LLM Rule Generation”.

ASTautomationLLMData WarehouseMaxComputeBigQuerySQL Migration
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.