Mastering Cache Layers: From Distributed to CPU Cache for Performance Gains
This article explores how introducing various cache layers—from database and distributed caches to local and CPU caches—bridges the gap between fast CPUs and slow I/O, detailing performance benefits, cache miss handling, consistency strategies, memory management, and techniques to avoid false sharing.
1. Cache and Multi-level Cache
Cache performance‑optimization cases are abundant, ranging from operating‑system caches to databases, distributed caches, and local caches. All share a simple purpose: bridging the huge gap between the CPU's high compute power and the slow read/write latency of I/O.
1.1 Cache Introduction
When traffic is low, the database can handle the load and the application talks directly to the DB. As traffic grows, DB queries become a bottleneck, prompting the introduction of a distributed cache to reduce DB pressure and increase QPS. When the distributed cache itself becomes a bottleneck, a local cache is added to offload the distributed layer and cut network and serialization overhead.
1.2 Read/Write Performance Improvement
Caching improves read/write performance by reducing I/O operations. Disk and network I/O are far slower than memory access.
Read optimization: A cache hit returns data directly, bypassing I/O and lowering read cost.
Write optimization: Writes are batched in a buffer, allowing the I/O device to process them in bulk and reducing write cost.
1.3 Cache Miss
Cache miss is inevitable; the cache must keep hot data within limited capacity to balance performance and cost. Most caches use the LRU (Least Recently Used) algorithm to evict keys that have not been accessed recently.
Approximate LRU
Strict LRU can be expensive at large scale (e.g., scanning 500 k keys in Redis). Approximate LRU samples a few keys (e.g., five) and evicts the one with the longest idle time, offering similar effectiveness with far lower cost, though it may occasionally evict a recently used key.
Avoid Short‑Term Massive Expiration
When bulk‑loading data into the cache (e.g., via Excel upload), identical TTLs can cause a sudden surge of cache misses. Adding a random component to TTL (e.g., TTL = 8h + random(8000)ms) spreads expirations over time.
1.4 Cache Consistency
Systems should strive for consistency between DB and cache; the cache‑aside pattern is the most common approach. Avoid unconventional patterns such as updating the cache before the DB or vice‑versa, which increase inconsistency risk.
Typical cache design patterns include:
Cache‑aside (read‑through/write‑through/write‑back)
Write‑through
Write‑back
Even with cache‑aside, inconsistencies can arise from cache‑invalidation failures or timing issues. For example, thread A reads a stale value after a cache miss while thread B updates the DB and invalidates the cache; if B’s invalidation completes before A writes, A may overwrite the DB with outdated data.
Mitigation techniques include delayed double delete and CDC‑based synchronization, both of which increase system complexity and must be weighed against business tolerance.
Delayed double delete: after invalidating the cache, enqueue a delayed delete command; a background thread processes the queue.
CDC sync: capture DB changes (e.g., via Canal), publish to Kafka, and let consumers trigger cache invalidation.
2. From Heap Memory to Direct Memory
2.1 Direct Memory Introduction
Java caches can reside in heap memory or off‑heap (direct) memory. Heap caches suffer from GC pauses, especially when the cache size is large. Off‑heap caches avoid GC pressure because only references remain in the heap; the actual objects live outside, reducing GC impact.
However, off‑heap memory requires explicit allocation and deallocation, introducing risks of OOM or memory leaks, and data must be serialized for access.
2.2 Direct Memory Management
Allocating and freeing off‑heap memory is expensive; typical implementations allocate large chunks and sub‑allocate smaller blocks. Reclaimed memory is usually returned to a pool for reuse. Algorithms such as jemalloc (used by Redis and FreeBSD) provide efficient allocation and fragmentation control.
3. CPU Cache
Beyond software caches, CPU caches (L1, L2, L3) further affect performance. A cache line is typically 64 bytes; data is fetched from main memory in line‑sized chunks, making sequential access patterns highly efficient.
False Sharing
When multiple threads modify different variables that reside on the same cache line, the line invalidates repeatedly, causing performance degradation. This phenomenon is called false sharing.
Example of false sharing:
class NoPadding {
long no0;
long no1;
}Solution: pad the fields so they occupy separate cache lines.
class Padding {
long p1, p2, p3, p4, p5, p6, p7;
volatile long no0 = 0L;
long p9, p10, p11, p12, p13, p14;
volatile long no1 = 0L;
}JVM can also add padding automatically using the @sun.misc.Contended annotation:
@sun.misc.Contended static final class CounterCell {
volatile long value;
CounterCell(long x) { value = x; }
}Lock‑free concurrency structures (e.g., LongAdder, ConcurrentHashMap ’s CounterCell) inherently avoid false sharing and offer the highest throughput.
4. Summary
As interface response time requirements rise, cache usage—distributed, local, and CPU caches—becomes a cornerstone of performance optimization. When adopting any caching technique, understand its concepts, trade‑offs, and pitfalls to design robust systems from the start. For deeper details, refer to "Redis Design and Implementation", the Disruptor design docs, and related source code.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
