Why Vectorization Supercharges Database Performance: Deep Dive into StarRocks
This article explains how CPU‑centric vectorization, especially SIMD, reduces instruction count and CPI, addresses the four major CPU bottlenecks, and how StarRocks systematically applies automatic and manual SIMD techniques, verification methods, and a suite of engineering optimizations to achieve multi‑fold query speedups.
Why Vectorization Improves Database Performance
Most modern databases run on CPU architectures, so performance optimization boils down to improving CPU execution. The fundamental equation is CPU Time = Instruction Number * CPI * Clock Cycle Time, where reducing the instruction count and cycles per instruction (CPI) directly speeds up queries.
Instruction Number – depends on program complexity.
CPI – cycles needed to execute each instruction.
Clock Cycle Time – tied to hardware characteristics.
CPU execution consists of five stages: fetch, decode, execute, memory access, and write‑back. The frontend handles fetch and decode, while the backend handles the remaining stages. Intel’s Top‑down Microarchitecture Analysis Method classifies performance bottlenecks into four categories: Retiring, Bad Speculation, Frontend Bound, and Backend Bound. Their primary causes are lack of SIMD, branch misprediction, instruction‑cache misses, and data‑cache misses respectively. Vectorization mitigates all four bottlenecks.
Fundamentals of SIMD
SIMD (Single Instruction Multiple Data) allows a single instruction to operate on multiple data elements simultaneously, contrasting with traditional SISD (single‑instruction single‑data). For a simple addition A + B = C performed on four data pairs, scalar code requires eight loads, four adds, and four stores, whereas 128‑bit SIMD reduces this to two loads, one add, and one store, yielding a theoretical 4× speedup; 512‑bit SIMD can provide up to 16×.
How to Trigger Vectorization
Compiler automatic vectorization – no code changes, works for simple loops.
Providing compiler hints (e.g., pragma, attributes) to expose more parallelism.
Using parallel programming APIs such as OpenMP or Intel Cilk to insert pragmas.
Employing wrapper libraries around SIMD intrinsics.
Directly writing SIMD intrinsics.
Writing assembly code manually.
StarRocks prefers the first two approaches (automatic vectorization and hints). For performance‑critical paths that the compiler cannot auto‑vectorize, developers resort to manual SIMD intrinsics.
Further details on hint usage and intrinsics can be found in the author’s personal blog: https://blog.bcmeng.com/post/database-learning.html
Verifying Vectorized Code
Compile with flags that report vectorization decisions, e.g., -fopt-info-vec-all, -fopt-info-vec-optimized, -fopt-info-vec-missed, -fopt-info-vec-note (GCC).
Inspect the generated assembly (using Godbolt, perf, VTune, etc.). Presence of registers xmm/ymm/zmm or instructions prefixed with v indicates SIMD execution.
Database Vectorization in StarRocks
Vectorizing a database is a large‑scale, systematic performance‑engineering effort. The main challenges include:
Adopting a columnar layout across disk, memory, and network, requiring a complete redesign of storage and execution engines.
Ensuring every operator, expression, and function has a vectorized implementation – a multi‑year engineering effort.
Maximizing SIMD usage within operators, which demands case‑by‑case micro‑optimizations.
Redesigning memory management to handle column‑oriented batches.
Reworking core data structures (join, aggregate, sort, etc.) for columnar processing.
Achieving >5× speedup for all critical operators, eliminating performance bottlenecks.
Key Optimization Areas (Seven Points)
Leverage high‑performance third‑party libraries (Parallel Hashmap, Fmt, SIMD Json, Hyper Scan).
Adopt efficient data structures and algorithms – e.g., a low‑cardinality global dictionary that transforms string Group‑By into integer Group‑By, yielding ~3× query speedup.
Implement adaptive runtime optimizations, such as dynamically selecting join runtime filters based on selectivity (keep at most three useful filters).
Extensive SIMD usage in operators and expressions, reducing both branch‑prediction errors and instruction‑cache misses.
Low‑level C++ tuning (move vs. copy, reserve vectors, inlining, loop unrolling, compile‑time calculations).
Memory‑pool reuse (e.g., a Column Pool and block‑based allocation for HLL aggregation) to cut allocation overhead and improve throughput.
CPU cache optimization, including prefetching strategies to mitigate cache‑miss penalties.
These combined efforts have turned StarRocks into a mature, high‑performance MPP analytical database that consistently delivers multi‑fold query speed improvements.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
