Vectorization and Roaring Bitmap Techniques in Database Query Execution
This article explains how classic SQL execution engines use the volcano model and expression trees, discusses their performance drawbacks, introduces vectorized execution to reduce overhead, and describes Roaring Bitmap compression methods with container types for efficient storage and processing of integer sets.
1. Vectorization
Consider the SQL statement select c1, c2 from t where c1 < 100 and c4 = 10. The query is compiled into three operators—tablescan, filter, and projection—each containing expressions such as the filter condition c1 < 100 and c4 = 10.
Classic SQL Execution Engine
Expressions are evaluated using an expression‑tree model, while operators are organized as an operator‑tree (volcano model) linked by iterators. A typical operator interface is:
<code class="language-java">class Operator {
Row next();
void open();
void close();
Operator children[];
}
</code>For example, a projection operator implements:
<code class="language-java">class Projection extends Operator {
Expression projectionExprs[];
Row next() {
Row output = new Row(projectionExprs.length);
Row input = children[0].next();
for (int i = 0; i < projectionExprs.length; i++) {
output.set(i, projectionExprs[i].eval(input));
}
return output;
}
}
</code>The volcano model is simple and decouples operators, but each row incurs many virtual‑function calls and repeated expression evaluations, leading to significant overhead for large data sets.
Advantages and Disadvantages
Advantages: Easy implementation; each operator handles a single responsibility; the planner only needs to assemble the operator tree.
Disadvantages: Expression evaluation triggers a deep call chain for every row; operator next() also incurs virtual calls, which become costly at scale.
Vectorized Execution
Vectorization amortizes the per‑row cost by processing batches of rows (size M) at once, reducing total overhead from C × N to C × N / M. It also creates tighter loops that compilers can optimize more aggressively.
Before vectorization, an integer‑addition expression looks like:
<code class="language-java">class ExpressionIntAdd extends Expression {
Datum eval(Row input) {
int left = input.getInt(leftIndex);
int right = input.getInt(rightIndex);
return new Datum(left + right);
}
}
</code>After vectorization, the same operation processes arrays:
<code class="language-java">class VectorExpressionIntAdd extends VectorExpression {
int[] eval(int[] left, int[] right) {
int[] ret = new int[left.length];
for (int i = 0; i < left.length; i++) {
ret[i] = left[i] + right[i];
}
return ret;
}
}
</code>2. Roaring Bitmap
Standard bitmaps suffer from high memory usage and only support 32‑bit integers. Roaring Bitmap compresses data by partitioning the 32‑bit space into 65,536 buckets (high 16 bits) and storing low 16‑bit values within each bucket using three container types:
Array Container: Stores values in a 16‑bit short array when the bucket holds fewer than 4,096 elements.
Bitmap Container: Switches to a 64‑KB bitmap when a bucket exceeds 4,096 elements, offering better density.
Run Container: Uses run‑length encoding for consecutive values, storing only the start and length of each run.
These adaptive containers provide both memory efficiency and fast set operations, making Roaring Bitmap a popular choice in big‑data systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
