Databases 17 min read

How StarRocks Accelerates Low‑Cardinality String Queries with Global Dictionary Optimization

This article explains how StarRocks uses a global dictionary to transform low‑cardinality string columns into integer codes, dramatically improving query performance across scan, filter, aggregation, join, shuffle, and sort phases, and details the construction, maintenance, and practical impact of this optimization.

StarRocks
StarRocks
StarRocks
How StarRocks Accelerates Low‑Cardinality String Queries with Global Dictionary Optimization

Why Introduce Low‑Cardinality Dictionary Optimization

String columns are common in databases, but processing them is often three times slower than integer columns. Replacing strings with integer dictionary codes can significantly boost performance.

Impact on Filter and Aggregation

When a column is dictionary‑encoded, filter predicates compare integer codes instead of raw strings, which is faster. During aggregation, using the encoded values as keys reduces hash‑table comparison costs and memory usage, accelerating the operation.

Need for a Global Dictionary

In a distributed execution engine, the dictionary must be consistent across all nodes. A global dictionary allows direct use of integer codes throughout the query lifecycle, avoiding costly string‑to‑code conversions during network shuffle.

The planner decides whether to apply the global dictionary; queries without network shuffle or without low‑cardinality columns (e.g., insert into t1 select * from t2) skip the optimization.

Efficient Construction of the Global Dictionary

Two common approaches exist:

Users specify low‑cardinality columns in the schema, enabling dictionary construction at load time. This requires schema changes and data re‑import when cardinality changes.

Build the dictionary during data import via a central node. This can become a bottleneck under high concurrency.

StarRocks adopts an adaptive method that automatically identifies low‑cardinality columns using statistics collected by the CBO optimizer, eliminating the need for manual schema hints.

Ensuring Dictionary Correctness

When new strings appear in an imported segment, the system checks three cases:

If the new segment lacks a local dictionary, the column is likely high‑cardinality and the global dictionary is not used.

If the local dictionary is a subset of the global dictionary, the existing global dictionary remains valid.

If the local dictionary contains strings absent from the global dictionary, the global dictionary is rebuilt.

Each import creates a new segment version; queries use MVCC to ensure they see a consistent snapshot, so ongoing queries are unaffected by dictionary updates.

Using the Global Dictionary Efficiently

During query execution, the scan node can replace local dictionary codes with global codes, allowing subsequent operators to work entirely with integers. Decoding back to strings occurs only when required by specific operators or final result rendering.

Examples: select count(*) from lineitem group by l_shipmode; – the entire query can stay in integer form, no decode needed.

select count(*), l_shipmode from lineitem group by l_shipmode;

– a decode node is inserted before output to retrieve the original strings.

For functions like substr(l_shipmode, 1, 2), the system computes the function on the global dictionary once, creates a new dictionary for the results, and maps integer codes directly, avoiding per‑row string decoding.

String Function Optimization

Most string functions produce deterministic outputs for a given input. By applying the function at the dictionary level, the engine computes the result once per distinct string value, then maps the original codes to new codes, achieving substantial speedups even for complex expressions.

Optimization Results

Benchmarking on typical StarRocks workloads shows that enabling low‑cardinality optimization yields an average 3× speedup compared to queries without the optimization. The setting is enabled by default in StarRocks 2.0+: set cbo_enable_low_cardinality_optimize = true; Representative SQL statements and their performance improvements are illustrated in the accompanying charts.

Conclusion

Global dictionary acceleration applies to all query phases.

The CBO optimizer adaptively selects the best strategy without user‑specified schema.

The approach is transparent to users and does not require data re‑import.

It delivers industry‑leading performance for low‑cardinality string processing.

It supports a wide range of string‑related operations, including aggregation, filtering, and deterministic functions.

Filter diagram
Filter diagram
Aggregation diagram
Aggregation diagram
Global dictionary construction
Global dictionary construction
Segment storage structure
Segment storage structure
Dictionary encoding in segment
Dictionary encoding in segment
Performance comparison chart
Performance comparison chart
StarRocksstring optimizationglobal dictionarylow cardinality
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.