Big Data 11 min read

Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

Alibaba Cloud’s cloud‑native data lake analysis solution combines fully managed storage (OSS‑HDFS), a one‑stop lake management platform (Data Lake Formation), and multimodal compute capabilities, delivering high performance, massive scalability, and low cost for big‑data and AI workloads across offline, real‑time, and lake‑house scenarios.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

This article compiles insights from Alibaba Cloud senior experts Zheng Kai and Fan Zhen, presented at the Alibaba Cloud EMR 2.0 online launch.

Three Core Elements of Alibaba Cloud’s Cloud‑Native Data Lake Solution

1. Fully Managed Lake Storage (OSS‑HDFS)

OSS‑HDFS is the third‑generation data lake storage that fully supports HDFS and POSIX protocols, enabling seamless integration with big‑data and AI ecosystems.

OSS‑HDFS ecosystem
OSS‑HDFS ecosystem

Performance : atomic rename/delete operations in milliseconds, directory size queries return instantly.

Scale : supports billions of hot, warm, and cold files; OSS bandwidth scales horizontally.

Cost : tiered storage (standard 30 % + infrequent 30 % + archive 40 %) reduces overall expense.

The service offers a “three‑no” migration path—no code changes, no path changes, no downtime—and fast import from existing OSS data.

2. One‑Stop Lake Management (Data Lake Formation)

Data Lake Formation (DLF) provides a fully managed, high‑availability, and scalable lake‑management platform.

Data Lake Formation UI
Data Lake Formation UI

Unified Metadata Service : compatible with open‑source HMS, supports multi‑catalog and version control.

Fine‑Grained Permissions : column‑level access control, Apache Ranger integration, audit logging.

Cost Optimization : intelligent hot/warm/cold data identification, automated tiered storage policies, lifecycle management.

Storage Access Acceleration : read/write/meta‑data caching, full‑scene acceleration for big‑data analytics and AI training.

3. Multimodal Lake Compute

EMR 2.0 adopts a “one lake, multiple architectures” model, supporting three typical scenarios:

Offline Lake

Uses OSS‑HDFS for hierarchical storage and DLF for metadata management.

Leverages engines such as Spark (with Alibaba‑optimized performance and Remote Shuffle Service) and the “three swords” of Delta Lake, Hudi, and Iceberg.

Real‑Time Lake

Combines OSS‑HDFS, DLF, and incremental storage engines (Delta Lake, Hudi, Iceberg) to provide ACID guarantees and versioned commits.

Supports Flink for stream processing and Presto/Trino for low‑latency ad‑hoc queries.

Lakehouse Analysis

Integrates with OLAP engines like StarRocks, Doris, and ClickHouse for sub‑second queries.

StarRocks offers a unified lakehouse architecture with native support for Delta Lake, Hudi, and Iceberg, enabling both warehouse and lake queries with near‑identical performance.

Serverless

Serverless StarRocks (and upcoming Serverless Spark, Presto/Trino) provides compute‑storage separation, on‑demand scaling, and extreme cost efficiency, delivering a fully managed experience with 99.9 % SLA.

Overall, Alibaba Cloud’s cloud‑native data lake analysis solution delivers high performance, massive scalability, and low cost for big‑data and AI workloads, covering offline, real‑time, and lakehouse use cases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeServerlessAnalyticsBig DatastorageData Lake
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.