Feb 1, 2025 · Big Data

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

This article explains the challenges of big‑data processing in the cloud era, introduces Spark’s native‑language SQL engine rewrites, discusses vectorization and code generation techniques, describes cloud‑native storage‑compute separation with Remote Shuffle services such as Apache Celeborn, and presents the production benefits of Alibaba Cloud’s EMR Serverless Spark.

Big DataCodegenRemote Shuffle

0 likes · 12 min read

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

DataFunSummit

Aug 7, 2023 · Big Data

Performance Optimizations in Impala for Data Lake Queries: Iceberg and Codegen Enhancements

This article presents a comprehensive overview of Impala's high‑performance MPP query engine, its architecture for data‑lake workloads, and detailed performance optimizations including Iceberg table format improvements, manifest caching, and various Codegen techniques such as asynchronous compilation and caching.

Big DataCodegenData Lake

0 likes · 17 min read

Performance Optimizations in Impala for Data Lake Queries: Iceberg and Codegen Enhancements