Case Study: Migrating Baicaowei's On‑Premise Hadoop Data Platform to Alibaba Cloud Native Data Lake
This article details Baicaowei's migration from an IDC‑hosted Hadoop cluster to a cloud‑native data lake on Alibaba Cloud, outlining the business drivers, pain points of the legacy platform, architectural goals, design principles, solution selection, implementation steps, and future outlook for the new big‑data ecosystem.
1. Baicaowei Company Overview
Baicaowei is a full‑channel snack brand that integrates product研发, production, trade, warehousing, and logistics, serving over 100 million users with more than 1,000 SKUs across nuts, dried fruit, meat, pastries, and candy.
2. Pain Points of IDC Self‑Built Big Data Platform
The on‑premise CDH‑based Hadoop cluster met basic data‑warehouse needs but suffered from long upgrade cycles, high operational cost, complex component management, limited real‑time capabilities, insufficient security controls, and non‑compliance with corporate disaster‑recovery requirements.
3. Migration Goals
Eliminate infrastructure and compliance risks by moving to the cloud.
Simplify architecture, shorten data pipelines, and improve development and运维 efficiency.
Enhance scalability for resource scaling and component upgrades.
Introduce a robust permission system to strengthen data security.
4. Architecture Design Principles
Use open‑source core modules for vendor independence.
Adopt fully managed services to offload heavy运维.
Employ batch‑stream unified processing to reduce complexity.
Implement storage‑compute separation for better security and scalability.
5. Solution Selection
After comparing AWS and Alibaba Cloud data‑lake offerings, the team chose Alibaba Cloud because it provides an open‑source‑friendly stack, full‑managed services, and better alignment with existing Alibaba‑cloud workloads.
6. Cloud‑Native Data Lake Architecture
Unified Storage: Object Storage Service (OSS) replaces HDFS for scalable, highly reliable storage.
Data Lake Management: Data Lake Formation (DLF) offers unified metadata, fine‑grained column‑level permissions, and data‑ingestion templates.
Data Format: Delta Lake is used for incremental updates and unified batch‑stream processing.
Compute Engines: Spark (offline & streaming) and Presto (interactive federated queries) run on Alibaba Cloud EMR.
Development & Scheduling: Zeppelin notebooks, Airflow, and EMR Studio provide end‑to‑end data‑development and workflow orchestration.
7. Migration Implementation
Data migration: JindoFS distcp moves HDFS files to OSS.
Metadata migration: DLF syncs Hive Metastore to cloud‑based metadata service.
Task migration: Existing Zeppelin notebooks are transferred unchanged.
Scheduling migration: EMR Workflow is used initially; Airflow will replace it when available.
8. Business Data Ingestion
DLF handles batch and incremental loads from RDS, converting data to Delta Lake; Spark Streaming processes binlog changes for near‑real‑time visibility, while DLF monitors task health and alerts via SMS, email, or DingTalk.
9. Future Outlook
The team plans to add multi‑engine support (Presto), enhance permission management, and optimize storage with OSS lifecycle policies, multi‑version control, and cold‑hot tiering.
10. Conclusion
The case study demonstrates Baicaowei's end‑to‑end migration to a cloud‑native big‑data platform, highlighting technical challenges, architectural decisions, and ongoing optimization efforts.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.