Backend Development 13 min read

Mastering Cache Optimization: From Distributed to CPU Cache and Beyond

This article explores the fundamentals and advanced techniques of cache optimization, covering multi‑level caching, read/write performance gains, cache miss handling, consistency strategies, heap versus direct memory, CPU cache effects, false sharing, and practical mitigation patterns.

Alibaba Cloud Developer

Jun 9, 2021

Mastering Cache Optimization: From Distributed to CPU Cache and Beyond

Cache and Multi‑Level Caching

Cache performance optimization cases are abundant, ranging from operating systems to databases, distributed caches, and local caches. All share the simple purpose of bridging the gap between the CPU's high compute power and the slow I/O of storage.

Introducing a new component, such as a cache, inevitably increases system complexity; while it boosts performance, it also brings design trade‑offs that developers must consider.

1 Cache Introduction

When traffic is low, a database can handle read/write pressure directly. As traffic grows, database load and latency increase, prompting the adoption of distributed caches to relieve DB pressure and increase QPS.

Eventually, distributed caches become bottlenecks due to high QPS, eviction, and network jitter. Introducing a local cache reduces pressure on the distributed layer and cuts serialization overhead.

2 Read/Write Performance Gains

Caches improve read/write performance by reducing costly I/O operations. The table below (omitted) shows that disk and network I/O latency far exceeds memory access latency.

Read optimization: a cache hit returns data directly, skipping I/O.

Write optimization: writes are buffered and batched, allowing the I/O device to process them efficiently.

Cache also yields obvious QPS and response‑time improvements.

3 Cache Miss

Cache miss is inevitable; a cache must keep hot data within limited capacity to balance performance and cost. Most caches use the LRU algorithm to evict least‑recently‑used keys.

Approximate LRU

Strict LRU on millions of keys is expensive; Redis uses an approximate LRU by randomly sampling five keys and evicting the one with the longest idle time, reducing cost while achieving similar effectiveness.

Avoid Massive Short‑Term Expirations

Batch loading data into cache (e.g., via Excel upload) can assign identical TTLs, causing a sudden surge of DB traffic when the cache expires. Randomizing TTL, such as TTL=8h+random(8000)ms, spreads expirations.

4 Cache Consistency

Systems should strive for DB‑cache consistency, commonly using the cache‑aside pattern. Avoid unconventional patterns like updating cache before DB or vice‑versa, as they increase inconsistency risk.

Typical cache design patterns include cache‑aside for business systems, and write‑through/write‑back for OS, databases, and distributed caches.

Cache‑Aside Inconsistency

Even with cache‑aside, inconsistencies can arise from cache invalidation failures or timing issues. For example, thread A reads stale data after a cache miss while thread B updates the DB and invalidates the cache; if B’s invalidation completes before A writes, stale data may be written back.

Mitigation techniques include delayed double delete and CDC‑based synchronization, which increase system complexity and must be weighed against business tolerance.

Delayed double delete: after invalidating, place the command in a delayed queue processed by another thread.

CDC sync: subscribe to MySQL binlog changes via Canal, forward to Kafka, and trigger cache invalidation on consumption.

From Heap Memory to Direct Memory

1 Introduction of Direct Memory

Java local caches can be heap‑based or off‑heap (direct memory). Heap caches suffer from GC pauses for large objects; off‑heap caches avoid GC pressure but require manual memory management and serialization.

Direct memory reduces GC impact because only references reside on the heap while actual data lives off‑heap.

Off‑heap reclamation relies on System.gc(), which is nondeterministic; thus applications often manage memory manually using malloc/free‑like APIs.

2 Direct Memory Management

Allocating and freeing off‑heap memory is costly; large blocks are allocated and then sub‑allocated. Reclaimed blocks are kept in a pool for reuse, requiring algorithms to minimize fragmentation.

Jemalloc is a widely used allocator (e.g., in Redis, OHC cache) with Java ports available.

CPU Cache

Beyond distributed and local caches, CPU caches (L1, L2, L3) also affect performance under high concurrency.

CPU cache lines are 64‑byte units; accessing memory loads an entire line, making sequential data access highly efficient.

False sharing occurs when multiple threads modify different variables that reside on the same cache line, causing unnecessary invalidations and performance loss.

NoPadding {<br>    long no0;<br>    long no1;<br>}

False Sharing Solutions

Padding separates variables onto different cache lines:

Padding {<br>    long p1, p2, p3, p4, p5, p6, p7;<br>    volatile long no0 = 0L;<br>    long p9, p10, p11, p12, p13, p14;<br>    volatile long no1 = 0L;<br>}

JDK provides the @sun.misc.Contended annotation to let the JVM add padding automatically:

@sun.misc.Contended static final class CounterCell {<br>    volatile long value;<br>    CounterCell(long x) { value = x; }<br>}

Lock‑free concurrency designs (e.g., Disruptor) can also eliminate false sharing without padding.

Conclusion

As response‑time requirements rise, cache usage becomes pervasive. When introducing any technology, consider its concepts, principles, applicable scenarios, and pitfalls to avoid risks early in design. Distributed cache, local cache, and CPU cache each have extensive considerations; interested readers can consult "Redis Design and Implementation" and Disruptor documentation for deeper details.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Memory Management Caching distributed cache CPU cache

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.