Databases 13 min read

Vectorized Storage Layer Refactoring in Apache Doris: Design, Implementation, and Performance Evaluation

This article explains the motivation, design, and implementation of vectorizing Apache Doris's storage layer using SIMD techniques, covering engine overview, vectorized programming concepts, storage architecture, index and predicate optimizations, delayed materialization, output improvements, and performance test results.

DataFunSummit

Oct 27, 2022

Vectorized Storage Layer Refactoring in Apache Doris: Design, Implementation, and Performance Evaluation

Introduction – The article introduces the vectorized transformation of Apache Doris's storage layer aimed at boosting query performance by leveraging vectorization features.

01 Apache Doris Engine Overview – Doris is positioned as an MPP OLAP database supporting both real‑time and batch data import (Spark, Flink, relational databases) and delivering sub‑second query latency, with low development cost compared to Flink or Spark.

02 Vectorized Programming Introduction – Vectorized programming (SIMD) processes columns in batches rather than row‑by‑row, enabling single‑instruction‑multiple‑data operations that are well‑suited for sum, min, max calculations in analytical workloads.

03 Apache Doris Storage Layer Overview – The storage layer reads data, deserializes it, and performs fine‑grained splitting, decoding, and merging (compaction). Queries may involve merging multiple files, handling fixed‑length and variable‑length columns, and applying predicate filters.

04 Storage Layer Vectorization Refactoring – Refactoring steps include: (1) identifying code paths suitable for SIMD (e.g., batch reads, comparisons); (2) rewriting those modules with SIMD intrinsics; (3) evaluating alternative optimizations for non‑vectorizable logic.

Index‑Based Optimizations – Doris supports prefix indexes and uses bitmap (e.g., RoaringBitmap) to prune rows, reducing I/O. Fixed‑length types (int, float, double) benefit directly from SIMD batch reads, while variable‑length types (strings) require dictionary encoding or conversion to numeric forms.

Predicate Push‑Down and Delayed Materialization – Predicate push‑down moves filter evaluation to the storage layer, potentially reducing data volume. Delayed materialization reads non‑predicate columns only after filtering, trading extra seeks for lower data transfer; its effectiveness depends on predicate selectivity and data type costs.

Output Optimizations – Not all models need merging; detail tables can stream directly to the execution layer, while key‑based and aggregation models require merging. Batch aggregation using SIMD further improves throughput.

Performance Evaluation – Early testing (SSB benchmark) shows significant storage‑layer speedups and modest SQL gains, though further tuning is needed for end‑to‑end performance.

Conclusion & Recommendations – Effective optimization requires deep code understanding, awareness of SIMD capabilities, and proper use of performance tools; community participation is encouraged for ongoing improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Storage Engine OLAP SIMD vectorization Apache Doris

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.