Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution
Alibaba Cloud’s cloud‑native data lake analysis solution combines fully managed storage (OSS‑HDFS), a one‑stop lake management platform (Data Lake Formation), and multimodal compute capabilities, delivering high performance, massive scalability, and low cost for big‑data and AI workloads across offline, real‑time, and lake‑house scenarios.
This article compiles insights from Alibaba Cloud senior experts Zheng Kai and Fan Zhen, presented at the Alibaba Cloud EMR 2.0 online launch.
Three Core Elements of Alibaba Cloud’s Cloud‑Native Data Lake Solution
1. Fully Managed Lake Storage (OSS‑HDFS)
OSS‑HDFS is the third‑generation data lake storage that fully supports HDFS and POSIX protocols, enabling seamless integration with big‑data and AI ecosystems.
Performance : atomic rename/delete operations in milliseconds, directory size queries return instantly.
Scale : supports billions of hot, warm, and cold files; OSS bandwidth scales horizontally.
Cost : tiered storage (standard 30 % + infrequent 30 % + archive 40 %) reduces overall expense.
The service offers a “three‑no” migration path—no code changes, no path changes, no downtime—and fast import from existing OSS data.
2. One‑Stop Lake Management (Data Lake Formation)
Data Lake Formation (DLF) provides a fully managed, high‑availability, and scalable lake‑management platform.
Unified Metadata Service : compatible with open‑source HMS, supports multi‑catalog and version control.
Fine‑Grained Permissions : column‑level access control, Apache Ranger integration, audit logging.
Cost Optimization : intelligent hot/warm/cold data identification, automated tiered storage policies, lifecycle management.
Storage Access Acceleration : read/write/meta‑data caching, full‑scene acceleration for big‑data analytics and AI training.
3. Multimodal Lake Compute
EMR 2.0 adopts a “one lake, multiple architectures” model, supporting three typical scenarios:
Offline Lake
Uses OSS‑HDFS for hierarchical storage and DLF for metadata management.
Leverages engines such as Spark (with Alibaba‑optimized performance and Remote Shuffle Service) and the “three swords” of Delta Lake, Hudi, and Iceberg.
Real‑Time Lake
Combines OSS‑HDFS, DLF, and incremental storage engines (Delta Lake, Hudi, Iceberg) to provide ACID guarantees and versioned commits.
Supports Flink for stream processing and Presto/Trino for low‑latency ad‑hoc queries.
Lakehouse Analysis
Integrates with OLAP engines like StarRocks, Doris, and ClickHouse for sub‑second queries.
StarRocks offers a unified lakehouse architecture with native support for Delta Lake, Hudi, and Iceberg, enabling both warehouse and lake queries with near‑identical performance.
Serverless
Serverless StarRocks (and upcoming Serverless Spark, Presto/Trino) provides compute‑storage separation, on‑demand scaling, and extreme cost efficiency, delivering a fully managed experience with 99.9 % SLA.
Overall, Alibaba Cloud’s cloud‑native data lake analysis solution delivers high performance, massive scalability, and low cost for big‑data and AI workloads, covering offline, real‑time, and lakehouse use cases.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
