Databases 21 min read

Vectorization in Apache Doris: Design, Implementation, and Future Roadmap

This article explains how Apache Doris adopts CPU‑level vectorization and columnar storage to boost query performance, details the design and current status of its vectorized engine, and outlines future work such as JOIN acceleration, storage‑layer vectorization, import optimization, and extensive SQL function support.

DataFunSummit
DataFunSummit
DataFunSummit
Vectorization in Apache Doris: Design, Implementation, and Future Roadmap

Vectorization transforms single‑value operations into batch operations, leveraging SIMD instructions to achieve multiple‑fold CPU speedups. From a CPU perspective, modern processors can load several 32‑bit values into a 128‑bit register and compute them in parallel, dramatically reducing instruction count and improving cache locality.

In databases, vectorization similarly processes a batch of rows as columns, allowing operations like scans, filters, and arithmetic to run on contiguous column data rather than row‑by‑row, which improves cache affinity and reduces virtual‑function overhead.

Apache Doris implements vectorization by introducing a column‑oriented in‑memory format (Block) that replaces the traditional RowBatch/Tuple model, redesigning the execution engine to operate on columns, and building a vectorized function framework that supports common operators such as aggregation, sorting, and JOIN.

The current Doris vectorized engine (enabled with set enable_vectorized_engine = true and set batch_size = 4096 ) already supports sort, agg, scan, and union on wide tables, delivering 2‑10× performance gains on typical queries compared with the row‑based engine.

Future plans include full vectorization of JOIN operators (expected 30‑40% speedup), storage‑layer vectorization to eliminate row‑based aggregation and deduplication, import‑pipeline vectorization to reduce format conversions, and expanding SIMD‑accelerated SQL functions (over 200 already vectorized, with more to come).

Additional roadmap items involve refactoring fundamental data types (Date/DateTime, Decimal, HLL) for better memory layout, adding full support for String and Array types, revisiting aggregate‑table semantics, and integrating a cost‑based optimizer to enable deeper in‑lining and further performance improvements.

performance optimizationSIMDSQL enginecolumnar storageVectorizationApache Doris
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.