How Ant Group’s Flex Engine Supercharges Flink with Vectorization
This article details Ant Group’s Flex vectorized engine built on Velox, covering the current state of vectorization, Flex’s architecture (Flink + Velox), core feature development, correctness guarantees, large‑scale deployment results, and future directions for full‑link vectorization and broader hardware support.
Vectorization Technology Overview
Vectorization (SIMD) enables a single instruction to process multiple data items, delivering exponential performance gains across parallel computing modes. Modern CPUs (x86 SSE/AVX, ARM Neon) natively support SIMD, and compilers such as GCC, LLVM (including Gandiva) provide automatic vectorization. Open‑source libraries like xsimd also expose SIMD primitives across x86 and ARM platforms.
Hardware support: richer register resources.
SIMD instruction sets: native CPU support.
Compiler optimizations: GCC, LLVM auto‑vectorize code.
Specialized libraries: xsimd.
For compute‑intensive workloads, vectorization dramatically accelerates execution, though it requires appropriate hardware registers and software stack configuration.
System Architecture
Flex is a unified vectorized engine that combines Flink’s streaming capabilities with Velox’s batch‑oriented vectorized execution, expressed as Flex = Flink + Velox . The architecture consists of several layers:
JNI Glue Layer – adapts Flink to Velox via native calls.
Native Operator Layer – implements Flink plan nodes in native C++.
Plan Conversion Layer – transforms Flink’s row‑oriented RowData into column‑oriented structures for Velox.
Data Conversion Layer – handles row‑to‑column and column‑to‑row transformations.
Fallback Layer – routes unsupported operators back to Flink’s Java implementation.
Unified Memory Management Layer – provides a common memory pool for native operators.
Core Feature Development
Flex focuses on performance, correctness, usability, and stability:
Execution‑plan optimizations: rich rule set to boost expression evaluation.
Native operators: support for SLS, ODPS, and Paimon connectors.
Efficient data representation: Arrow‑based ColumnRowData for high‑performance columnar processing.
SIMD‑accelerated functions: over 15 string‑handling functions and numerous mathematical functions rewritten with SIMD.
Projection reorder: isolates heavy RexInputRef or long‑value columns into native calc operators.
Correlation and join acceleration: rewrite correlation sub‑queries into native SIMD‑enabled joins.
Stream‑join condition SIMD acceleration: extract JSON‑value or ON/WHERE predicates into native calc for faster evaluation.
Correctness and Data Integrity Guarantees
Two verification systems ensure semantic alignment between Flink (Java) and Velox (C++) implementations:
Function‑level automated testing : reuses Flink’s unit tests, runs them on both legacy and new stacks, and performs bit‑wise result comparison to surface any divergence.
Job‑level verification (Minos) : generates twin jobs (one writing to Hive, the other to Paimon) for the same SQL, compares partitioned results, and flags mismatches.
Additional mechanisms include a blacklist for function signatures, fine‑grained fallback for complex types (TIMESTAMP, DECIMAL), and configurable function mapping to reuse existing implementations across Flink and Velox.
Application, Deployment, and Results
Flex has been rolled out at Ant Group on a massive scale:
Supported >6,800 production jobs.
Enabled by default in Flink 1.18.1.2.
Saved more than 30,000 CPU cores.
Achieved ~50% TPS improvement for Paimon‑based workloads.
Core use cases include exposure, click, and access tracking, as well as advertising, “touch‑and‑go”, flash‑sale, financial scheduling, and large‑scale security protection.
Future Work and Priorities
Key directions for the next phases include:
Full‑link vectorization covering stateful operators and watermark handling.
Low‑overhead row‑to‑vector conversion to eliminate costly data copies.
Extended support for stateful operators and complex SQL types (nested rows, arrays, maps).
Portability to ARM and GPU architectures.
Broader SIMD function coverage, especially for mathematical kernels.
Overall, Flex demonstrates how integrating a mature vectorized engine (Velox) with a streaming framework (Flink) can deliver substantial performance gains, robust correctness guarantees, and a scalable path forward for modern data‑intensive applications.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
