Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark
This article presents a comprehensive technical overview of Alibaba Cloud AnalyticDB MySQL and its Serverless Spark integration, detailing architecture, core optimizations, security enhancements, and real‑world case studies that demonstrate how to achieve cost‑effective, high‑performance data lake analytics.
The article introduces AnalyticDB MySQL (ADB) as a lake‑house product that combines self‑developed and open‑source components, covering the five key aspects of data acquisition, storage, computation, management, and application, and emphasizes the integration of Spark for AI and BI workloads.
In the Serverless Spark core optimization section, the architecture is described from the user entry points (SQL/Jar console, DMS, DataWorks, SparkSubmit) through the OpenAPI module, Spark control service, driver/executor cluster, down to the various data sources accessed via OSS, MaxCompute, or VPC‑linked services, highlighting multi‑tenant isolation and elastic resource provisioning.
Key enhancements include a 30‑API OpenAPI for full Spark job lifecycle management, Airflow integration, ENI‑based network bridging for VPC isolation, a custom multi‑tenant Spark UI with efficient event rendering, automatic log rotation, and diagnostic & tuning recommendations.
The platform also offers a free Notebook supporting SQL, Python, and Scala, built‑in Hudi/Delta lake formats, a unified Catalog system (HoodieCatalog, DeltaCatalog, ADBCatalog), and a high‑throughput Lakehouse API using Arrow format, achieving up to 6× faster data access compared with JDBC.
Security features comprise a RAM & STS token‑based OSS access control, eliminating AK/SK exposure, and a TEE‑based confidential computing engine certified by the China Academy of Information and Communications Technology.
Performance optimizations cover OSS multipart upload with a semi‑transaction layer ( setupJob , setupTask , commitTask , commitJob ), native vector engines (Gluten + Velox, Databricks Photon) delivering 1.3‑2.8× speedups, and a distributed cache service (LakeCache) providing >10× I/O acceleration.
Three customer case studies illustrate practical benefits: (1) high‑throughput lake‑warehouse ingestion achieving 8 GB/s write speed and 20 PB storage, (2) a low‑cost lakehouse built with ADB Spark, Hudi, and OSS reducing compute time by 3× and cost by up to 50 %, and (3) migration from CDH to ADB Spark delivering 20 % cost reduction, 80 % lower operational overhead, and flexible elastic scaling.
The article concludes with a promotional trial offering 5000 ACU‑hours plus 100 GB storage and a DingTalk community link for further engagement.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.