How Alibaba’s Cloud‑Native Data Lake Solves Big Data Challenges
Alibaba Cloud’s Data Lake Analytics (DLA) tackles the growing complexity of data scenarios by offering cloud‑native, serverless solutions for data lake management, massive metadata construction, and high‑performance Spark and Presto engines, while addressing challenges such as high entry barriers, stability, and multi‑tenant isolation.
1 Opportunities and Challenges of Data Lakes
Data lakes help enterprises handle increasingly diverse data scenarios, complex structures, and varied processing needs. Gartner 2020 reports 39% of users already use data lakes, and 34% plan to adopt within a year.
Since 2018, Alibaba Cloud has built a cloud‑native Data Lake Analytics (DLA) product that combines OSS storage with elastic, pay‑as‑you‑go services, offering data lake management, Serverless Spark, and Serverless SQL to unlock data value.
2 How to Manage and Build a Data Lake
Key challenges are efficient metadata construction on OSS and fast ingestion of non‑OSS data.
2.1 Massive File Metadata Auto‑Construction
OSS stores files in many formats (CSV, JSON, Parquet, ORC, Avro, Hudi, Delta Lake) with millions of files that are continuously uploaded.
Rich formats including custom delimiters.
File count can reach millions.
Dynamic uploads require incremental metadata updates.
DLA implements an automatic metadata construction technology that resolves "ten‑thousand tables and partitions" identification and incremental updates.
2.2 In‑Lake Ingestion Techniques
DLA supports three ingestion modes: Mirror, Partition, and Incremental, enabling one‑click data sync from databases and message logs to OSS, supporting hot‑cold data separation and real‑time analytics.
3 Cloud‑Native Data Lake Platform Infrastructure
DLA is a multi‑tenant architecture deployed per Region with virtual clusters for isolation, supporting Serverless Spark and Serverless SQL.
3.1 Efficient Resource Supply
Integrated with ECS, ACK, and ECI, DLA can scale by 300 nodes per minute and guarantee up to 50,000 compute nodes per large Region.
3.2 Security Protection
Per‑job temporary tokens that expire on suspicious activity.
DDOS and injection detection with automatic port blocking.
Isolated security containers for compute nodes.
Network isolation via ENI and VPC configuration.
3.3 High‑Throughput Network
OSS access via high‑throughput bandwidth.
ENI‑based VPC access provides intra‑VPC bandwidth.
4 Serverless Spark Technical Challenges
Traditional Spark on data lakes suffers from poor OSS performance, complex debugging, and high operational cost.
4.1 Spark‑OSS Performance Optimizations
Implemented a MultipartUpload‑based FileOutputFormat that writes directly to the final directory, eliminating temporary writes and reducing metadata operations. This yields up to 62% faster execution on 1 TB Terasort and 124% improvement on dynamic partition workloads.
4.2 OSS Metadata Cache
Caches OSS object metadata locally after first access, cutting ResolveRelation time roughly in half and improving overall query performance by about 60%.
4.3 Multi‑Tenant UI Service
DLA’s Spark UI removes Eventlog dependence, storing only lightweight UI metadata in OSS. The stateless UI server scales horizontally, supports token‑based multi‑tenant isolation, and includes automatic log rotation for long‑running jobs.
5 Serverless SQL Technical Challenges
DLA Serverless SQL builds on PrestoDB, adding multi‑tenant isolation and a multi‑Coordinator architecture to achieve high availability and seamless upgrades.
5.1 Multi‑Tenant Isolation
A ResourceManager collects per‑tenant resource usage from Coordinators, compares it to thresholds, and instructs Workers to penalize over‑consuming tenants, achieving isolation within ~1.3 seconds.
5.2 Multi‑Coordinator Architecture
A FrontNode layer routes queries to multiple Coordinators using round‑robin. Zookeeper elects a master Coordinator for global tasks, enabling automatic failover and zero‑downtime upgrades.
6 End‑to‑End Best Practices
DLA provides a complete solution: unified metadata service with permission control, one‑click metadata crawling for OSS files, seamless data sync from RDS/PolarDB/MongoDB, Hudi streaming with T+10 min latency, serverless SQL and Spark with on‑demand scaling (up to 500 nodes per minute), and up to 10× cost‑performance improvement over traditional stacks.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
