Can Smart Caching Slash SSD Read Latency for High‑Performance Ad Search?
This article examines the necessity of caching in ad‑search workloads, evaluates classic eviction policies such as LRU, LFU, Redis Approx‑LRU, OS PageCache, DoubleClock, ARC, 2Q, TinyLFU, and presents a layered TLS‑centered cache design (SsdEngine) with detailed benchmark results showing hit‑rate and latency improvements across Zipfian and uniform workloads.
Cache Algorithms Overview
LRU – Least Recently Used
LRU keeps a linked list of cached entries ordered by recent access. On a get the entry is moved to the front; on a put a new entry is inserted at the front and, if the cache is full, the tail (least‑recently used) entry is evicted.
get(int key) {
// locate in map, move to front if found
auto it = cache_map[key];
list.splice(list.begin(), list, it);
}
put(int key, int value) {
if (cache_map.contains(key)) {
// update value and move to front
} else {
if (size >= capacity) list.pop_back();
list.emplace_front(key, value);
cache_map[key] = list.begin();
}
}LFU – Least Frequently Used
LFU evicts the entry with the smallest access count. It maintains frequency buckets so that each access increments the entry’s frequency and moves it to the next bucket. Because frequencies can become stale, a periodic decay is usually applied.
get(int key) {
auto it = cache_map[key];
int freq = it->second.freq;
// move to next frequency bucket (or create one)
}
put(int key, int value) {
if (cache_map.contains(key)) {
// update existing entry
} else {
if (size >= capacity) evict_least_frequent();
// insert into frequency‑1 bucket
}
}Redis Approx‑LRU
Redis stores a 24‑bit timestamp ( robj.lru) in each object. When eviction is needed it samples a set of keys and removes the one with the oldest timestamp, providing an approximate LRU without extra linked lists.
typedef struct redisObject {
unsigned type:4;
unsigned encoding:4;
unsigned lru:LRU_BITS; // LRU time or LFU data
int refcount;
void *ptr;
} robj;OS PageCache – Double Clock
The kernel maintains two FIFO lists: an active list and an inactive list. Pages start in the inactive list; a reference moves them to the active list. Eviction considers both the activity flag and the reference bit, approximating LRU with lower overhead.
SsdEngine Cache Design
SsdEngine targets read‑intensive ad‑search workloads with strict SLA requirements. It combines a simplified RCU model with a two‑layer cache to minimise contention and maximise read safety.
Layered Cache Architecture
Two layers are used:
Thread‑local storage (TLS) cache : implemented with a radix‑tree/Adaptive Radix Tree (ART) for fast per‑thread look‑ups.
Central cache : an array of page slots accessed via lock‑free compare‑and‑swap (CAS). The TLS cache handles the common case; a miss falls back to the central cache.
Flying Pointer‑Swizzling
When multiple threads request the same page, the first thread allocates the page and marks the entry as “in‑flight” by setting the low‑order bit of the stored pointer. Subsequent threads see the marker and skip duplicate reads. After the page is loaded, the marker is cleared and the page is inserted into the central cache.
fetch_page() {
page = page_manager->alloc_page();
local_table->insert(key, reinterpret_cast<const char*>(page) | 1);
return {page, true};
}
fetch_page_done() {
const char* p = local_table->find(key);
if (!(reinterpret_cast<uintptr_t>(p) & 1)) return 0; // still in‑flight
local_table->remove(key);
page = reinterpret_cast<const char*>(reinterpret_cast<uintptr_t>(p) & ~1);
// add to central cache
}Adaptive Replacement Cache (ARC)
ARC maintains two LRU lists (T1 – recent, T2 – frequent) and two ghost lists (B1, B2). Hits in the ghost lists adjust the adaptive target size p, allowing the cache to react to workload changes without manual tuning.
2Q
2Q uses a FIFO queue for first‑time accesses and an LRU queue for items accessed at least twice, offering a cheap approximation of ARC.
TinyLFU
TinyLFU adds a small admission filter (Doorkeeper + Count‑Min Sketch) in front of an LRU window. The filter estimates global access frequency, allowing the cache to admit only likely‑hot items while keeping memory overhead low.
Evaluation
Environment : Intel Xeon Gold 5118 (2.30 GHz), 16 GB L3 cache, Linux 3.10 kernel, RocksDB built with GCC 12 ‑O2. SsdEngine is linked into a single‑node RocksDB instance.
Workloads :
10 M keys (~2 pages per key).
Access patterns: Zipfian (s = 0.6 and s = 0.9) and uniform.
Cache sizes: 4 GB, 6 GB, 8 GB, 10 GB, 12 GB.
Results
Zipfian s = 0.6 : LRU achieves higher hit ratios than DoubleClock, especially with larger caches. The tail‑latency (9999‑th percentile) is also lower for LRU.
Zipfian s = 0.9 : Same trend – LRU retains a clear advantage in hit rate and tail latency.
Uniform distribution : LRU and DoubleClock perform similarly; the extra complexity of DoubleClock provides no measurable benefit.
TinyLFU vs. LRU : TinyLFU adds a modest hit‑rate improvement across all distributions, most noticeable when the cache is small relative to the key space. It requires a warm‑up period before the filter stabilises.
Key Takeaways
For workloads with strong locality (Zipfian), pure LRU outperforms approximate algorithms.
When access is uniform, simpler policies are sufficient.
Frequency‑aware admission (TinyLFU) can boost performance for limited cache sizes.
Flying pointer‑swizzling and the TLS‑central cache split reduce duplicate I/O and contention in read‑heavy scenarios.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
