Databases 21 min read

Vectorization in Apache Doris: Design, Implementation, and Future Roadmap

This article explains how Apache Doris adopts CPU‑level vectorization and columnar storage to boost query performance, details the design and current status of its vectorized engine, and outlines future work such as JOIN acceleration, storage‑layer vectorization, import optimization, and extensive SQL function support.

DataFunSummit

Mar 21, 2022

Vectorization in Apache Doris: Design, Implementation, and Future Roadmap

Vectorization transforms single‑value operations into batch operations, leveraging SIMD instructions to achieve multiple‑fold CPU speedups. From a CPU perspective, modern processors can load several 32‑bit values into a 128‑bit register and compute them in parallel, dramatically reducing instruction count and improving cache locality.

In databases, vectorization similarly processes a batch of rows as columns, allowing operations like scans, filters, and arithmetic to run on contiguous column data rather than row‑by‑row, which improves cache affinity and reduces virtual‑function overhead.

Apache Doris implements vectorization by introducing a column‑oriented in‑memory format (Block) that replaces the traditional RowBatch/Tuple model, redesigning the execution engine to operate on columns, and building a vectorized function framework that supports common operators such as aggregation, sorting, and JOIN.

The current Doris vectorized engine (enabled with set enable_vectorized_engine = true and set batch_size = 4096) already supports sort, agg, scan, and union on wide tables, delivering 2‑10× performance gains on typical queries compared with the row‑based engine.

Future plans include full vectorization of JOIN operators (expected 30‑40% speedup), storage‑layer vectorization to eliminate row‑based aggregation and deduplication, import‑pipeline vectorization to reduce format conversions, and expanding SIMD‑accelerated SQL functions (over 200 already vectorized, with more to come).

Additional roadmap items involve refactoring fundamental data types (Date/DateTime, Decimal, HLL) for better memory layout, adding full support for String and Array types, revisiting aggregate‑table semantics, and integrating a cost‑based optimizer to enable deeper in‑lining and further performance improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization SIMD SQL Engine Columnar Storage vectorization Apache Doris

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.