Databases 21 min read

Boost System Performance: Using Locality Principles to Cut Database Queries

This article explains the locality principle—time and space locality—and shows how applying these concepts to caching and data access in distributed systems can dramatically reduce database query volume, improve latency, and achieve up to 84% performance gains while managing memory and GC overhead.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Boost System Performance: Using Locality Principles to Cut Database Queries

1. Locality Principle

The locality principle states that programs exhibit temporal and spatial locality: during execution, they tend to access a limited region of code and data repeatedly over short periods.

Time locality

If a instruction or data item is accessed, it is likely to be accessed again soon. This is common in function calls where recent parameters or local variables are reused.

Space locality

When a memory location is accessed, nearby locations are likely to be accessed shortly thereafter, as seen in loops iterating over arrays.

2. Caching

Because of locality, placing hot data in faster storage (caches) can greatly improve performance by replacing expensive memory or network I/O with quicker accesses. Cache effectiveness depends on the strength of a program's locality.

Example: QuickSort benefits from good spatial locality, while HeapSort suffers from poor locality, leading to worse performance on modern CPUs.

3. Applying Locality in Distributed Systems

To exploit locality at the system level, we can adopt cache strategies similar to CPU caches: separate instruction and data caches, use multi‑level caches, and combine exclusive and shared caches.

3.1 Exploiting Time Locality

On a single machine, such as Apple M1, the cache hierarchy includes L1 instruction (128 KB × 4), L1 data (64 KB × 4), L2 (4 MB low‑power, 12 MB high‑performance), and a shared SLC (8 MB). These layers separate code and data, use multiple levels, and combine exclusive and shared storage.

In a distributed cluster, we can mirror these ideas:

Separate caches for configuration data and user data.

Multi‑level caching: local in‑memory cache plus distributed cache (e.g., Tair).

Exclusive per‑node cache combined with shared cluster cache.

3.2 Exploiting Space Locality

CPU caches load data in cache lines (e.g., 128 bytes) rather than single bytes, leveraging DRAM burst mode and tolerating some waste for higher throughput. Similarly, we can load larger contiguous data blocks from the database to benefit from spatial locality.

4. Practical Case: Consumer Operations Strategy Platform

The platform stores user state in a table keyed by (userId, graphId, vertexId). Query volume reaches >10⁶ QPS, with massive write traffic.

4.1 Challenges

Huge data volume (billions of rows).

High write TPS (>10 k).

Even higher query QPS (>100 k).

4.2 Measures

Use Lindorm (LSM‑tree) with TTL for massive data and automatic cleanup.

Batch writes, merge state data, and trim unnecessary records before write.

Apply multi‑level caching: local cache plus distributed cache.

4.3 Exploring Space Locality

Two cache‑loading granularities were evaluated:

(userId, graphId, vertexId) : loads a single row per query – low cache hit.

(userId, graphId) : loads all vertices of a user’s graph – moderate hit.

(userId) : loads all vertices of all graphs for a user – high hit but larger memory use.

Results:

Switching from per‑row to (userId, graphId) reduced query volume to 23.5% of original.

Average query latency dropped to 0.35 ms (65% overall improvement).

4.4 Pushing to the Limit

Further expanding to (userId) granularity yielded:

Cache load volume only 4.12% of query volume.

Single load latency increased to ~3.9 ms, but still acceptable.

Overall query latency reduced to 0.16 ms (84% improvement).

4.5 Long‑Term Performance

After traffic growth, the (userId) strategy achieved:

Cache load volume 1.95% of query volume, cache hit rate 97.95%.

Single load latency 1.17 ms.

Average query latency 0.02 ms after amortizing load cost.

4.6 Risks and Monitoring

Potential risks include memory pressure, uneven user data distribution, and GC overhead. Monitoring with Prometheus tracks cache usage, hit rate, load latency, failure rate, batch write/read sizes, and GC performance. Metrics show healthy operation: cache items per node ~600, per‑user state items ~12, GC pauses ~20 ms, heap usage ~20% after Young GC.

4.7 Review

Defining “adjacent” data uses the left‑most prefix of the primary key, ensuring physical adjacency in storage. The minimum cache load unit can be flexible (userId, graphId) or (userId) because memory constraints are looser than CPU cache line limits. In practice, setting an upper bound on loaded rows balances locality benefits with memory usage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsPerformance Optimizationdatabasecachinglocality
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.