Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox
Meituan enhances Apache Spark by integrating the Gluten‑Velox vectorized execution engine, converting row‑wise operations to columnar SIMD processing, which yields over 40 % memory savings and up to 13 % faster runtimes across thousands of ETL jobs, while addressing stability, ORC support, shuffle redesign, and off‑heap memory optimization.
Apache Spark is a widely used compute engine for data engineering and machine learning. Vectorized execution can save resources and accelerate jobs without hardware upgrades. Meituan adopts the Gluten+Velox solution to provide a vectorized execution engine for Spark, and this article presents their practice and insights.
1. What is vectorized computing
1.1 Parallel data processing with SIMD instructions – example of adding two arrays element‑wise. The traditional SISD execution loads each operand, computes, and stores the result for every element. SIMD allows a single instruction to operate on multiple data elements simultaneously, improving CPU utilization.
1.2 Vectorized execution framework – data locality and runtime overhead. Row‑by‑row processing suffers from poor CPU cache hit rate, variable‑length fields, and virtual‑function call overhead. Organizing data column‑wise (Block/Vector) improves cache locality and reduces function calls, enabling better compiler auto‑vectorization.
1.3 How to use vectorized computing
Auto‑vectorization via compiler flags (e.g., gcc -ftree-vectorize -O3).
Inspect compiler hints and disassembly for vectorized loops.
Use libraries such as Intel Intrinsics or xsimd.
Embed SIMD assembly (architecture‑specific).
Compiler directives like #pragma simd or OpenMP #pragma omp simd.
Compiler hints such as __restrict to aid optimization.
Example of a vectorized array addition using AVX intrinsics:
#include <immintrin.h> // Intrinsic AVX header
void addArraysAVX(const int* a, const int* b, int* c, int num) {
assert(num % 8 == 0);
for (int i = 0; i < num; i += 8) {
__m256i v_a = _mm256_load_si256((__m256i*)&a[i]);
__m256i v_b = _mm256_load_si256((__m256i*)&b[i]);
__m256i v_c = _mm256_add_epi32(v_a, v_b);
_mm256_store_si256((__m256i*)&c[i], v_c);
}
}
int main(int argc, char* argv[]) {
const int ARRAY_SIZE = 64 * 1024;
int a[ARRAY_SIZE] __attribute__((aligned(32)));
int b[ARRAY_SIZE] __attribute__((aligned(32)));
int c[ARRAY_SIZE] __attribute__((aligned(32)));
srand(time(0));
for (int i = 0; i < ARRAY_SIZE; ++i) { a[i] = rand(); b[i] = rand(); c[i] = 0; }
auto start = std::chrono::high_resolution_clock::now();
addArraysAVX(a, b, c, ARRAY_SIZE);
auto end = std::chrono::high_resolution_clock::now();
std::cout << "addArraysAVX took " << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << " microseconds." << std::endl;
return 0;
}Compilation with g++ test.cpp -O0 -std=c++11 -mavx2 -o test shows a 3× speedup compared with the scalar version.
2. Why vectorize Spark
OLAP engines (ClickHouse, Doris) already use vectorization. Projects like Databricks Photon and Meta’s Velox demonstrate 4× performance gains. Gluten+Velox enables Spark (a Java stack) to achieve similar gains. In Meituan’s large‑scale data warehouse, vectorization saves memory and speeds up jobs, achieving at least a 2× speedup in TPC‑H benchmarks.
3. Implementation at Meituan
3.1 Overall construction – focus on resource saving, use native C++/Rust instead of Java, adopt a plug‑in architecture, keep migration transparent, and develop a black‑box ETL test tool for result verification.
3.2 Spark+Gluten+Velox workflow – Spark extensions inject ColumnarOverrideRules, converting supported operators to native columnar operators. The driver sends a Substrait plan to executors, which invoke the native backend via JNI.
3.3 Staged rollout – hardware compatibility check, stability verification, performance validation, consistency checks, and gradual gray‑release.
Example of checking CPU SIMD support:
cat /proc/cpuinfo | grep --color -wE "bmi|bmi2|f16c|avx|avx2|sse"4. Challenges encountered
4.1 Stability – OOM during aggregation due to Velox’s flush thresholds; SIMD‑induced crashes caused by misaligned 16‑byte addresses in Arrow memory pools; resolved by lowering flush thresholds and fixing Arrow alignment.
4.2 ORC support – Velox originally handled DWRF and Parquet. Meituan added full ORC support, RLEv2 decoding with BMI2, Decimal type handling, file‑handle reuse, and ISA‑L accelerated decompression, achieving up to 2× read performance.
4.3 Native HDFS client – reduced random‑read amplification and routed reads away from slow DataNodes, cutting latency by two‑thirds and doubling throughput.
4.4 Shuffle redesign – introduced a compositional shuffle interface to avoid code explosion and eliminated extra RowVector↔RecordBatch conversions.
4.5 HBO (Historical Based Optimization) – extended on‑heap resource prediction to off‑heap memory, improving memory savings from 30% to 40% and eliminating OOMs.
4.6 Consistency – fixed issues with older ORC footers, distinct‑count aggregation, and floating‑point precision by adjusting Velox’s aggregation logic and string conversion settings.
5. Production impact
Over 20,000 ETL jobs have been migrated, yielding >40% average memory savings and ~13% reduction in execution time. A 30 TB job that previously took 7 h now finishes within 2 h thanks to vectorized decompression and columnar processing.
6. Future plans
Scale vectorized Spark to the majority of SQL workloads in 2024, upgrade to Spark 3.5 for better compatibility, continuously follow Gluten/Velox releases, broaden vectorized operator and UDF coverage, and convert remaining text‑file tables to ORC.
7. Acknowledgements
Thanks to Intel Gluten contributors, Velox community members, and Meituan’s platform team.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
