Big Data 9 min read

AnalyticDB Spark Architecture and Vectorized Engine Performance Overview

This article introduces the AnalyticDB Spark architecture, explains the need for Spark vectorization, surveys industry vectorized solutions, details ADB Spark's own vectorized implementation with Gluten and Velox, and presents performance test results showing a 6.98‑fold speedup over open‑source Spark.

DataFunSummit

Aug 17, 2024

AnalyticDB Spark Architecture and Vectorized Engine Performance Overview

AnalyticDB Spark (ADB Spark) is an open‑source Spark engine built on top of Alibaba Cloud's cloud‑native data warehouse AnalyticDB MySQL, providing serverless Spark clusters, multi‑tenant resource management, and secure metadata services.

Users can submit jobs via console, DMS, or spark‑submit scripts, while the underlying control plane handles resource allocation, metadata, multi‑tenant isolation, and security.

The engine supports elastic serverless Spark clusters that obtain resources through a unified metadata service and control plane, and can access data sources such as OSS/MaxCompute via AnyTunnel/STS tokens or VPC‑based services (ADB, RDS, HBase) through ENI networking.

The article explains why Spark vectorization is essential: after Spark 2.4, operator‑level optimizations have plateaued, and native columnar engines like ClickHouse and Arrow achieve far better performance using vectorized execution.

Industry vectorized solutions include Databricks Photon, the open‑source Gluten project (leveraging Velox or ClickHouse as backends), Alibaba's Blaze (Rust‑based DataFusion), and Apple’s datafusion‑comet.

ADB Spark evaluated several options and selected the Gluten + Velox combination, achieving an initial 1.76× speedup; after integration, the overall performance improvement reached 6.98× compared to open‑source Spark.

In the Gluten + Velox workflow, Spark’s Catalyst generates a physical plan, which Gluten transforms into native operators; supported operators run on Velox via JNI, while unsupported ones fall back to Spark’s Java implementation.

The native engine also adds features such as full‑homomorphic encryption (TEE), simplified configuration for native engine activation, secure OSS access via RAM&STS token refresh, enhanced UDF support (e.g., from_json), and integration with the Lakecache intelligent cache to accelerate IO.

Performance testing on TPC‑H 1‑TB (all queries) shows total query time of 4351.506 s for open‑source Spark 3.2.0 versus 623.273 s for ADB Spark 3.2.0, confirming a 6.98× speedup.

The article concludes with future plans: opening the vectorized capability to all customers, expanding supported data sources (e.g., JindoFS, AWS S3), adding more UDFs, continuously tracking Gluten/Velox community updates, and combining the vectorized engine with Alibaba Cloud’s Yitian hardware for further performance gains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data AnalyticDB vectorization Spark Velox Gluten

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.