Big Data 22 min read

How Alibaba’s Cloud‑Native Data Lake Solves Big Data Challenges

Alibaba Cloud’s Data Lake Analytics (DLA) tackles the growing complexity of data scenarios by offering cloud‑native, serverless solutions for data lake management, massive metadata construction, and high‑performance Spark and Presto engines, while addressing challenges such as high entry barriers, stability, and multi‑tenant isolation.

Alibaba Cloud Developer

Oct 25, 2020

How Alibaba’s Cloud‑Native Data Lake Solves Big Data Challenges

1 Opportunities and Challenges of Data Lakes

Data lakes help enterprises handle increasingly diverse data scenarios, complex structures, and varied processing needs. Gartner 2020 reports 39% of users already use data lakes, and 34% plan to adopt within a year.

Since 2018, Alibaba Cloud has built a cloud‑native Data Lake Analytics (DLA) product that combines OSS storage with elastic, pay‑as‑you‑go services, offering data lake management, Serverless Spark, and Serverless SQL to unlock data value.

2 How to Manage and Build a Data Lake

Key challenges are efficient metadata construction on OSS and fast ingestion of non‑OSS data.

2.1 Massive File Metadata Auto‑Construction

OSS stores files in many formats (CSV, JSON, Parquet, ORC, Avro, Hudi, Delta Lake) with millions of files that are continuously uploaded.

Rich formats including custom delimiters.

File count can reach millions.

Dynamic uploads require incremental metadata updates.

DLA implements an automatic metadata construction technology that resolves "ten‑thousand tables and partitions" identification and incremental updates.

2.2 In‑Lake Ingestion Techniques

DLA supports three ingestion modes: Mirror, Partition, and Incremental, enabling one‑click data sync from databases and message logs to OSS, supporting hot‑cold data separation and real‑time analytics.

3 Cloud‑Native Data Lake Platform Infrastructure

DLA is a multi‑tenant architecture deployed per Region with virtual clusters for isolation, supporting Serverless Spark and Serverless SQL.

3.1 Efficient Resource Supply

Integrated with ECS, ACK, and ECI, DLA can scale by 300 nodes per minute and guarantee up to 50,000 compute nodes per large Region.

3.2 Security Protection

Per‑job temporary tokens that expire on suspicious activity.

DDOS and injection detection with automatic port blocking.

Isolated security containers for compute nodes.

Network isolation via ENI and VPC configuration.

3.3 High‑Throughput Network

OSS access via high‑throughput bandwidth.

ENI‑based VPC access provides intra‑VPC bandwidth.

4 Serverless Spark Technical Challenges

Traditional Spark on data lakes suffers from poor OSS performance, complex debugging, and high operational cost.

4.1 Spark‑OSS Performance Optimizations

Implemented a MultipartUpload‑based FileOutputFormat that writes directly to the final directory, eliminating temporary writes and reducing metadata operations. This yields up to 62% faster execution on 1 TB Terasort and 124% improvement on dynamic partition workloads.

4.2 OSS Metadata Cache

Caches OSS object metadata locally after first access, cutting ResolveRelation time roughly in half and improving overall query performance by about 60%.

4.3 Multi‑Tenant UI Service

DLA’s Spark UI removes Eventlog dependence, storing only lightweight UI metadata in OSS. The stateless UI server scales horizontally, supports token‑based multi‑tenant isolation, and includes automatic log rotation for long‑running jobs.

5 Serverless SQL Technical Challenges

DLA Serverless SQL builds on PrestoDB, adding multi‑tenant isolation and a multi‑Coordinator architecture to achieve high availability and seamless upgrades.

5.1 Multi‑Tenant Isolation

A ResourceManager collects per‑tenant resource usage from Coordinators, compares it to thresholds, and instructs Workers to penalize over‑consuming tenants, achieving isolation within ~1.3 seconds.

5.2 Multi‑Coordinator Architecture

A FrontNode layer routes queries to multiple Coordinators using round‑robin. Zookeeper elects a master Coordinator for global tasks, enabling automatic failover and zero‑downtime upgrades.

6 End‑to‑End Best Practices

DLA provides a complete solution: unified metadata service with permission control, one‑click metadata crawling for OSS files, seamless data sync from RDS/PolarDB/MongoDB, Hudi streaming with T+10 min latency, serverless SQL and Spark with on‑demand scaling (up to 500 nodes per minute), and up to 10× cost‑performance improvement over traditional stacks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud native Data Lake metadata management presto Serverless Spark

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.