Databases 13 min read

How StarRocks Query Cache Supercharges High‑Concurrency Aggregations

StarRocks introduces a Query Cache that stores intermediate aggregation results in memory, enabling reuse across semantically equivalent, partition‑overlapping, or append‑only queries, which can boost query performance by 3‑17× in high‑concurrency scenarios while reducing CPU and disk load.

StarRocks

Aug 22, 2023

StarRocks developed a Query Cache to address the problem of CPU overload caused by many similar aggregation queries during peak periods. The cache stores local intermediate aggregation results in memory, allowing subsequent queries with the same semantics to reuse these results, avoiding redundant computation and disk access.

Key Characteristics

Cache stores intermediate aggregation results, not final query results, which increases hit rates.

Even queries that are not exactly identical can benefit from the cache.

In high‑concurrency tests, the cache improves query efficiency by 3‑17 times.

Applicable Scenarios

Semantically equivalent queries.

Queries with overlapping partition scans (predicate‑based query splitting).

Append‑only data scenarios where only new data is added.

Typical Use Cases

Frequent aggregation queries on wide tables or star‑schema joins.

Similar queries with slight variations.

Workloads where data is only appended, without updates.

Best‑Practice Recommendations

Select a dedicated date/datetime column as the partition key, preferably monotonically increasing.

Choose an appropriate partition size; overly large or small partitions reduce cache hit rates.

Ensure the number of buckets is at least several dozen; otherwise the cache may not be effective when pipeline_dop exceeds the bucket count.

Design tables so that queries can be executed at least as many times as the replication factor, maximizing cache utilization across BE nodes.

Enabling and Using Query Cache

By default the cache is disabled. Enable it with session variables and adjust pipeline_dop for single‑replica clusters:

set enable_query_cache=true;</code>
<code>set pipeline_dop=1;</code>
<code>set is_report_success=true;</code>
<code>set enable_profile=true;

After running a baseline query, you can inspect cache statistics via the BE API:

curl http://127.0.0.1:8040/api/query_cache/stat</code>
<code>{</code>
<code>  "capacity": 536870912,</code>
<code>  "usage": 3889,</code>
<code>  "usage_ratio": 0.000007243826985359192,</code>
<code>  "lookup_count": 42,</code>
<code>  "hit_count": 0,</code>
<code>  "hit_ratio": 0.0</code>
<code>}

Running a semantically equivalent query after the first populates the cache and yields a hit:

-- Q2: semantically equivalent query</code>
<code>SELECT (ifnull(sum(murmur_hash3_32(hour)),0) + ifnull(sum(murmur_hash3_32(k0)),0) + ifnull(sum(murmur_hash3_32(__c_0)),0)) AS fingerprint</code>
<code>FROM (SELECT date_trunc('hour', ts) AS hour, k0, sum(v1) AS __c_0 FROM t0 WHERE ts BETWEEN '2022-01-03 00:00:00' AND '2022-01-03 23:59:59' GROUP BY date_trunc('hour', ts), k0) AS t;

Cache statistics then show a hit count of 2 and a hit ratio of ~4.5%.

Performance Evaluation

Two benchmark suites were run:

Wide‑table test (SSB 100 GB, 10 concurrent queries) – cache hit delivered up to 10× speedup.

Star‑schema test (SSB 100 GB multi‑table, 10 concurrent queries) – cache hit delivered up to 17× speedup.

These results demonstrate that caching intermediate aggregation results can dramatically reduce query latency, especially for high‑concurrency analytical workloads.

Conclusion

StarRocks Query Cache, by caching intermediate aggregation data rather than final results, achieves higher reuse rates and flexibility across a variety of query patterns. It is particularly effective for frequent aggregation queries on large datasets, queries with overlapping partitions, and append‑only workloads, providing substantial performance gains and improved scalability.

Performance optimization StarRocks High Concurrency MPP database aggregation Query Cache

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.