How Multi‑Level Caching Boosts Performance and Avoids Common Pitfalls
This article explores the role of multi‑level caching—from distributed and local caches to direct memory and CPU cache—detailing performance gains, cache‑miss handling, consistency challenges, false sharing issues, and practical mitigation techniques such as approximate LRU, random TTL, delayed double‑delete, padding, and lock‑free designs.
Caching is a widely used technique for performance optimization, bridging the gap between the CPU's high compute power and the slow I/O of storage. Introducing a cache component improves throughput but also adds complexity that must be carefully managed.
1 Cache and Multi‑Level Cache
1.1 Introducing Cache
When traffic is low, the database can handle all reads and writes directly. As traffic grows, a distributed cache is added to reduce database load and increase QPS. When the distributed cache becomes a bottleneck, a local (in‑process) cache is introduced to offload the distributed layer and cut network and serialization overhead.
1.2 Read/Write Performance Gains
Cache improves performance by reducing I/O operations. The table below (omitted) shows that disk and network I/O latency far exceeds memory access latency.
Read optimization: a cache hit returns data directly, bypassing I/O.
Write optimization: writes are buffered and flushed in batches, allowing the I/O device to process them efficiently.
1.3 Cache Miss Handling
Cache miss is inevitable; the cache must keep hot keys within limited capacity. Most caches use LRU eviction, but strict LRU is expensive. Redis implements an approximate LRU by randomly sampling five keys and evicting the one with the longest idle time, which is cheaper and close in effectiveness, though it may occasionally evict a recently used key.
1.4 Avoiding Massive Expiration
When bulk data is loaded into the cache (e.g., via Excel import), identical TTLs cause a sudden surge of traffic to the database after expiration. Adding a random offset to TTL, such as TTL = 8h + random(8000)ms, spreads expirations over time.
1.5 Cache Consistency
The cache‑aside pattern is commonly used, but inconsistencies can arise from two main sources:
Cache invalidation failures due to middleware or network issues.
Timing anomalies where a cache miss reads stale data while another thread updates the DB and invalidates the cache.
Mitigation techniques include delayed double‑delete (enqueue an invalidation command and execute it after a short delay) and CDC‑based synchronization (listen to MySQL binlog changes via Canal, publish to Kafka, and trigger cache invalidation).
Delayed double‑delete: after the primary thread invalidates the cache, it places the command in a delay queue; a worker thread later executes the invalidation.
CDC sync: use Canal to capture MySQL binlog, push changes to Kafka, and let consumers invalidate the cache.
2 From Heap Memory to Direct Memory
2.1 Direct Memory Introduction
Java local caches can be heap‑based or off‑heap (direct memory). Heap caches suffer from GC pauses because cached objects have long lifetimes and require major GC cycles. Direct memory avoids most GC pressure but requires manual allocation/release and serialization, introducing risks of OOM and memory leaks.
2.2 Direct Memory Management
Direct memory allocation is expensive as it involves kernel calls. Large memory chunks are allocated and then subdivided for threads. Deallocation typically returns memory to a pool for reuse. Algorithms such as jemalloc (used by Redis and recommended for OHC caches) provide efficient allocation and low fragmentation.
3 CPU Cache
3.1 Cache Line
CPU caches are organized in 64‑byte cache lines. When the CPU loads data from main memory, it fetches an entire line, so sequential data access (e.g., iterating over arrays) benefits from spatial locality.
3.2 False Sharing
False sharing occurs when multiple threads modify different variables that reside on the same cache line, causing the line to bounce between cores and degrading performance. The diagram below illustrates two threads updating fields no0 and no1 that share a line, leading to repeated invalidations.
class NoPadding {
long no0;
long no1;
}3.3 Mitigation Strategies
Common solutions include padding variables so they fall on separate cache lines, using the @sun.misc.Contended annotation, or adopting lock‑free data structures that avoid false sharing altogether.
Padding example:
class Padding {
long p1, p2, p3, p4, p5, p6, p7;
volatile long no0 = 0L;
long p9, p10, p11, p12, p13, p14;
volatile long no1 = 0L;
}Contended annotation example:
@sun.misc.Contended
static final class CounterCell {
volatile long value;
CounterCell(long x) { value = x; }
}Lock‑free designs such as LongAdder in the JDK or the Disruptor framework eliminate the need for padding by using CAS‑based algorithms.
4 Summary
When high request latency (RT) requirements drive performance tuning, caching—distributed, local, and CPU cache—plays a central role. Introducing any technology should be done with a holistic view of its concepts, trade‑offs, and appropriate use cases to avoid hidden risks. For deeper details, consult resources such as "Redis Design and Implementation" and the Disruptor design documentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
