How StarRocks Query Cache Supercharges High‑Concurrency Aggregations
StarRocks introduces a Query Cache that stores intermediate aggregation results in memory, enabling reuse across semantically equivalent, partition‑overlapping, or append‑only queries, which can boost query performance by 3‑17× in high‑concurrency scenarios while reducing CPU and disk load.
StarRocks developed a Query Cache to address the problem of CPU overload caused by many similar aggregation queries during peak periods. The cache stores local intermediate aggregation results in memory, allowing subsequent queries with the same semantics to reuse these results, avoiding redundant computation and disk access.
Key Characteristics
Cache stores intermediate aggregation results, not final query results, which increases hit rates.
Even queries that are not exactly identical can benefit from the cache.
In high‑concurrency tests, the cache improves query efficiency by 3‑17 times.
Applicable Scenarios
Semantically equivalent queries.
Queries with overlapping partition scans (predicate‑based query splitting).
Append‑only data scenarios where only new data is added.
Typical Use Cases
Frequent aggregation queries on wide tables or star‑schema joins.
Similar queries with slight variations.
Workloads where data is only appended, without updates.
Best‑Practice Recommendations
Select a dedicated date/datetime column as the partition key, preferably monotonically increasing.
Choose an appropriate partition size; overly large or small partitions reduce cache hit rates.
Ensure the number of buckets is at least several dozen; otherwise the cache may not be effective when pipeline_dop exceeds the bucket count.
Design tables so that queries can be executed at least as many times as the replication factor, maximizing cache utilization across BE nodes.
Enabling and Using Query Cache
By default the cache is disabled. Enable it with session variables and adjust pipeline_dop for single‑replica clusters:
set enable_query_cache=true;</code>
<code>set pipeline_dop=1;</code>
<code>set is_report_success=true;</code>
<code>set enable_profile=true;After running a baseline query, you can inspect cache statistics via the BE API:
curl http://127.0.0.1:8040/api/query_cache/stat</code>
<code>{</code>
<code> "capacity": 536870912,</code>
<code> "usage": 3889,</code>
<code> "usage_ratio": 0.000007243826985359192,</code>
<code> "lookup_count": 42,</code>
<code> "hit_count": 0,</code>
<code> "hit_ratio": 0.0</code>
<code>}Running a semantically equivalent query after the first populates the cache and yields a hit:
-- Q2: semantically equivalent query</code>
<code>SELECT (ifnull(sum(murmur_hash3_32(hour)),0) + ifnull(sum(murmur_hash3_32(k0)),0) + ifnull(sum(murmur_hash3_32(__c_0)),0)) AS fingerprint</code>
<code>FROM (SELECT date_trunc('hour', ts) AS hour, k0, sum(v1) AS __c_0 FROM t0 WHERE ts BETWEEN '2022-01-03 00:00:00' AND '2022-01-03 23:59:59' GROUP BY date_trunc('hour', ts), k0) AS t;Cache statistics then show a hit count of 2 and a hit ratio of ~4.5%.
Performance Evaluation
Two benchmark suites were run:
Wide‑table test (SSB 100 GB, 10 concurrent queries) – cache hit delivered up to 10× speedup.
Star‑schema test (SSB 100 GB multi‑table, 10 concurrent queries) – cache hit delivered up to 17× speedup.
These results demonstrate that caching intermediate aggregation results can dramatically reduce query latency, especially for high‑concurrency analytical workloads.
Conclusion
StarRocks Query Cache, by caching intermediate aggregation data rather than final results, achieves higher reuse rates and flexibility across a variety of query patterns. It is particularly effective for frequent aggregation queries on large datasets, queries with overlapping partitions, and append‑only workloads, providing substantial performance gains and improved scalability.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
