Big Data 22 min read

Gaotu Data Platform Cloud Migration and Cost Management Case Study

This article presents a detailed case study of Gaotu's migration of its on‑premises big‑data platform to Tencent Cloud, covering background challenges, a phased migration plan, implementation lessons, cost‑control measures, and future optimization strategies, with practical insights for similar enterprises.

DataFunSummit
DataFunSummit
DataFunSummit
Gaotu Data Platform Cloud Migration and Cost Management Case Study

In response to China's "Double Reduction" policy and rising operational costs, Gaotu decided to migrate its self‑built big‑data platform to the cloud, moving the underlying cluster from Alibaba Cloud to Tencent EMR over a three‑month period starting in September 2021.

The migration was organized into four parts: (1) background and pain points of the on‑premises platform, (2) difficulties and challenges faced, (3) concrete implementation steps and lessons learned, and (4) current cost‑control status and future plans.

Key issues of the original platform included high cost, technology sprawl (Storm, Spark Streaming, Flink, HBase, ES, ClickHouse, Impala, Kudu, Presto, Kylin), lack of middle‑platform tools, and accumulated legacy problems.

The migration adopted a three‑stage approach: short‑term cluster lift‑and‑shift, mid‑term adoption of cloud PaaS products, and long‑term innovation with new technologies.

During implementation, Gaotu evaluated cloud vendors using a five‑dimensional model (cost & service, migration complexity, frontier technology, team stability, and service reliability) and ultimately selected Tencent Cloud, replacing the real‑time framework with Oceanus and using EMR for offline workloads.

Migration principles emphasized zero impact on C‑end services (night‑time execution), limited tolerance for B‑end downtime, handling of unclaimed tasks, dual‑run for critical offline pipelines, and careful rollback procedures.

Data migration combined metadata and data moves: table pruning, full‑snapshot of the metastore, full data copy, and incremental snapshots with Hive MSCK repair, while addressing issues such as table silence, partition explosion, and small‑file proliferation.

Task migration followed a four‑quadrant risk assessment, focusing on critical paths, automated verification, and resource monitoring via YARN REST APIs.

Cost‑control results showed a 40% reduction in cloud spend after migration, but growing business demands revived storage and compute waste. Gaotu tackled storage waste by monitoring table access, cleaning silent tables, reducing partition counts, and consolidating small files, achieving a drop in silent‑table rate from 80% to 40% and a 5% reduction in storage usage.

Compute‑resource governance introduced tidal‑style scaling (expanding resources at night, shrinking during the day) and task‑level resource alerts, aiming for a further 20% cost cut by Q4.

Future plans include adopting more PaaS tools, integrating new technologies, migrating legacy engines (Kudu, Impala), and continuing to refine both storage and compute cost‑optimization practices.

Q&A sections address migration sequencing, vendor selection criteria (cost, operational capability, responsibility), and handling massive incremental data during migration.

data migrationBig DataCloud Computingcost-optimizationTencent CloudGaotu
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.