Big Data 16 min read

Case Study: Migrating Baicaowei's On‑Premise Hadoop Data Platform to Alibaba Cloud Native Data Lake

This article details Baicaowei's migration from an IDC‑hosted Hadoop cluster to a cloud‑native data lake on Alibaba Cloud, outlining the business drivers, pain points of the legacy platform, architectural goals, design principles, solution selection, implementation steps, and future outlook for the new big‑data ecosystem.

Big Data Technology Architecture

Nov 13, 2021

Case Study: Migrating Baicaowei's On‑Premise Hadoop Data Platform to Alibaba Cloud Native Data Lake

1. Baicaowei Company Overview

Baicaowei is a full‑channel snack brand that integrates product研发, production, trade, warehousing, and logistics, serving over 100 million users with more than 1,000 SKUs across nuts, dried fruit, meat, pastries, and candy.

2. Pain Points of IDC Self‑Built Big Data Platform

The on‑premise CDH‑based Hadoop cluster met basic data‑warehouse needs but suffered from long upgrade cycles, high operational cost, complex component management, limited real‑time capabilities, insufficient security controls, and non‑compliance with corporate disaster‑recovery requirements.

3. Migration Goals

Eliminate infrastructure and compliance risks by moving to the cloud.

Simplify architecture, shorten data pipelines, and improve development and运维 efficiency.

Enhance scalability for resource scaling and component upgrades.

Introduce a robust permission system to strengthen data security.

4. Architecture Design Principles

Use open‑source core modules for vendor independence.

Adopt fully managed services to offload heavy运维.

Employ batch‑stream unified processing to reduce complexity.

Implement storage‑compute separation for better security and scalability.

5. Solution Selection

After comparing AWS and Alibaba Cloud data‑lake offerings, the team chose Alibaba Cloud because it provides an open‑source‑friendly stack, full‑managed services, and better alignment with existing Alibaba‑cloud workloads.

6. Cloud‑Native Data Lake Architecture

Unified Storage: Object Storage Service (OSS) replaces HDFS for scalable, highly reliable storage.

Data Lake Management: Data Lake Formation (DLF) offers unified metadata, fine‑grained column‑level permissions, and data‑ingestion templates.

Data Format: Delta Lake is used for incremental updates and unified batch‑stream processing.

Compute Engines: Spark (offline & streaming) and Presto (interactive federated queries) run on Alibaba Cloud EMR.

Development & Scheduling: Zeppelin notebooks, Airflow, and EMR Studio provide end‑to‑end data‑development and workflow orchestration.

7. Migration Implementation

Data migration: JindoFS distcp moves HDFS files to OSS.

Metadata migration: DLF syncs Hive Metastore to cloud‑based metadata service.

Task migration: Existing Zeppelin notebooks are transferred unchanged.

Scheduling migration: EMR Workflow is used initially; Airflow will replace it when available.

8. Business Data Ingestion

DLF handles batch and incremental loads from RDS, converting data to Delta Lake; Spark Streaming processes binlog changes for near‑real‑time visibility, while DLF monitors task health and alerts via SMS, email, or DingTalk.

9. Future Outlook

The team plans to add multi‑engine support (Presto), enhance permission management, and optimize storage with OSS lifecycle policies, multi‑version control, and cold‑hot tiering.

10. Conclusion

The case study demonstrates Baicaowei's end‑to‑end migration to a cloud‑native big‑data platform, highlighting technical challenges, architectural decisions, and ongoing optimization efforts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data cloud migration ETL Alibaba Cloud Hadoop Delta Lake

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.