Big Data 17 min read

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

This article chronicles a ten‑year evolution of a self‑built big data platform—detailing early Hadoop clusters, successive migrations to Spark, Hive, Hudi, and StarRocks, the operational challenges encountered, and the comprehensive shift to Alibaba Cloud EMR Serverless that delivered significant cost, performance, and stability gains while outlining future intelligent‑ecosystem plans.

Alibaba Cloud Big Data AI Platform

Nov 15, 2025

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

Background and Evolution of the Data Platform

In 2018 the company built a ten‑node Hadoop cluster to break data silos and support ~20 million users with ~300 GB of daily new data. From 2018‑2021 the MapReduce framework proved too slow, so the offline processing engine was migrated to Apache Spark 2.x and Hive 3.x , handling roughly 3,000 batch jobs per day.

2022 saw the introduction of Apache Hudi for incremental updates, reducing data freshness latency to the hour level. In 2023 StarRocks replaced Spark Thrift Server as the ad‑hoc query engine, serving >8,000 daily SQL queries with a 95th‑percentile response time under 60 seconds. By 2024 the on‑premise cluster was migrated to Alibaba Cloud EMR Serverless , scaling Yarn nodes beyond 1,000 and upgrading Spark to 3.x.

The upgraded cluster now stores 10 PB (single replica) with ~30 TB of daily ingest, supports >3,000 reports, runs >15,000 workflows, and operates >30 independent StarRocks instances. Daily YARN job count exceeds 40,000.

Key Pain Points of the Self‑Built Stack

Stability : NodeManager bandwidth limits, large shuffle volumes, and Hive 3.x bugs caused SLA breaches under high concurrency.

Elastic Resource Management : Manual hardware provisioning (2‑3 days) and YARN capacity scheduling led to uneven queue utilization and inability to scale quickly during traffic spikes.

Operational Efficiency : Extensive tuning, troubleshooting, and custom scripts for scaling and upgrades increased labor costs.

Spark Engine Issues

Peak jobs consumed >90 % of cluster resources, starving small jobs; off‑peak utilization dropped to ~30 %.

Open‑source Shuffle Service became a bottleneck, causing node failures under heavy load.

Migration from Spark 2.x to 3.x required extensive compatibility testing.

Mixed offline‑warehousing and ad‑hoc workloads made cost allocation difficult.

StarRocks Limitations

Data Ingestion : StreamLoad was slow and memory‑intensive; Broker Load’s soft‑resource isolation degraded read performance, forcing reliance on Spark Load for large volumes.

Resource Isolation : Soft isolation caused competition between ingestion and query workloads.

Operations : Open‑source version lacked a built‑in management console, requiring custom scripts for scaling, upgrades, and health checks.

Full‑Stack Migration to Alibaba Cloud EMR Serverless

The migration focused on two engine upgrades and a storage redesign:

All Hive SQL was rewritten to Spark SQL , leveraging EMR Serverless Spark’s higher execution efficiency and compatibility with ecosystems such as Kyuubi and Livy.

StarRocks was switched from an integrated compute‑storage deployment to a compute‑separated architecture, aligning with serverless best practices.

Higher‑level data services—including a one‑stop development platform, tagging system, real‑time development platform, data‑quality monitoring, and ad‑hoc query service—were built on top of the new stack.

Storage was migrated from HDFS to OSS‑HDFS , eliminating single‑point failures and reducing infrastructure cost to roughly one‑tenth of the self‑built solution.

EMR Serverless Spark Features

One‑stop data platform : Integrated development, debugging, scheduling, and operations with built‑in SQL editor, notebook, version control, workflow scheduler, and diagnostic tools.

Configuration flexibility : Spark‑default parameters can be tuned; jobs are submitted via standard spark-submit commands, e.g.

spark-submit \
  --master yarn \
  --conf spark.sql.shuffle.partitions=200 \
  --conf spark.driver.memory=4g \
  --conf spark.executor.memory=8g \
  my_job.py

Monitoring & diagnosis : Detailed resource metrics per workspace, queue, and task; automated optimization suggestions after each job.

Elasticity : Driver/Executor can scale down to a single core; container launch time <20 s; abundant IaaS + internal resource pool prevents shortages.

Performance : Proprietary Fusion engine with vectorized computation and Remote Shuffle Service (RSS) delivers >30 % faster execution compared with open‑source Spark.

EMR Serverless StarRocks Advantages

Rich management UI for instance creation, scaling, configuration, network, whitelist, task, and gateway management.

Health reports, slow‑SQL diagnostics, visual cache management, rolling upgrades, and full‑link audit.

True compute‑storage separation with physical isolation; independent workload groups improve internal‑table query performance by ~100 % and lake‑query performance by ~50 %.

Quantitative Benefits of the Upgrade

Cost optimization : Elastic resource usage reduced total cost by ~25.4 % compared with the traditional self‑built solution.

Stability : EMR Serverless Spark’s high‑performance shuffle service and StarRocks performance enhancements lowered SLA breach incidents and cut 95th‑percentile query latency to under 60 seconds.

Business agility : Rapid provisioning of compute resources shortened time‑to‑market for new scenarios.

Operational efficiency : Serverless management decreased daily maintenance workload by ~40 % and provided 24/7 technical support.

Performance gains : EMR Serverless Spark jobs ran >30 % faster; StarRocks queries (e.g., tagging and user‑segmentation) were 30 % quicker.

Future Intelligent‑Ecosystem Vision

Post‑migration, the roadmap focuses on four pillars:

Data processing : Leverage Alibaba Cloud EMR and PAI for collaborative data pipelines.

Business process optimization : Integrate large‑scale AI models for predictive risk control, automated operations, and intelligent monitoring.

Application layer : Build a data‑driven closed‑loop that supports intelligent decision‑making.

Algorithm innovation : Develop industry‑specific AI model libraries on PAI.

big data StarRocks data lake Spark emr serverless

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.