Operations 25 min read

Can Smart Caching Slash SSD Read Latency for High‑Performance Ad Search?

This article examines the necessity of caching in ad‑search workloads, evaluates classic eviction policies such as LRU, LFU, Redis Approx‑LRU, OS PageCache, DoubleClock, ARC, 2Q, TinyLFU, and presents a layered TLS‑centered cache design (SsdEngine) with detailed benchmark results showing hit‑rate and latency improvements across Zipfian and uniform workloads.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
Can Smart Caching Slash SSD Read Latency for High‑Performance Ad Search?

Cache Algorithms Overview

LRU – Least Recently Used

LRU keeps a linked list of cached entries ordered by recent access. On a get the entry is moved to the front; on a put a new entry is inserted at the front and, if the cache is full, the tail (least‑recently used) entry is evicted.

get(int key) {
    // locate in map, move to front if found
    auto it = cache_map[key];
    list.splice(list.begin(), list, it);
}

put(int key, int value) {
    if (cache_map.contains(key)) {
        // update value and move to front
    } else {
        if (size >= capacity) list.pop_back();
        list.emplace_front(key, value);
        cache_map[key] = list.begin();
    }
}

LFU – Least Frequently Used

LFU evicts the entry with the smallest access count. It maintains frequency buckets so that each access increments the entry’s frequency and moves it to the next bucket. Because frequencies can become stale, a periodic decay is usually applied.

get(int key) {
    auto it = cache_map[key];
    int freq = it->second.freq;
    // move to next frequency bucket (or create one)
}

put(int key, int value) {
    if (cache_map.contains(key)) {
        // update existing entry
    } else {
        if (size >= capacity) evict_least_frequent();
        // insert into frequency‑1 bucket
    }
}

Redis Approx‑LRU

Redis stores a 24‑bit timestamp ( robj.lru) in each object. When eviction is needed it samples a set of keys and removes the one with the oldest timestamp, providing an approximate LRU without extra linked lists.

typedef struct redisObject {
    unsigned type:4;
    unsigned encoding:4;
    unsigned lru:LRU_BITS; // LRU time or LFU data
    int refcount;
    void *ptr;
} robj;

OS PageCache – Double Clock

The kernel maintains two FIFO lists: an active list and an inactive list. Pages start in the inactive list; a reference moves them to the active list. Eviction considers both the activity flag and the reference bit, approximating LRU with lower overhead.

Double Clock page cache diagram
Double Clock page cache diagram

SsdEngine Cache Design

SsdEngine targets read‑intensive ad‑search workloads with strict SLA requirements. It combines a simplified RCU model with a two‑layer cache to minimise contention and maximise read safety.

Layered Cache Architecture

Two layers are used:

Thread‑local storage (TLS) cache : implemented with a radix‑tree/Adaptive Radix Tree (ART) for fast per‑thread look‑ups.

Central cache : an array of page slots accessed via lock‑free compare‑and‑swap (CAS). The TLS cache handles the common case; a miss falls back to the central cache.

Layered cache workflow
Layered cache workflow

Flying Pointer‑Swizzling

When multiple threads request the same page, the first thread allocates the page and marks the entry as “in‑flight” by setting the low‑order bit of the stored pointer. Subsequent threads see the marker and skip duplicate reads. After the page is loaded, the marker is cleared and the page is inserted into the central cache.

fetch_page() {
    page = page_manager->alloc_page();
    local_table->insert(key, reinterpret_cast<const char*>(page) | 1);
    return {page, true};
}

fetch_page_done() {
    const char* p = local_table->find(key);
    if (!(reinterpret_cast<uintptr_t>(p) & 1)) return 0; // still in‑flight
    local_table->remove(key);
    page = reinterpret_cast<const char*>(reinterpret_cast<uintptr_t>(p) & ~1);
    // add to central cache
}

Adaptive Replacement Cache (ARC)

ARC maintains two LRU lists (T1 – recent, T2 – frequent) and two ghost lists (B1, B2). Hits in the ghost lists adjust the adaptive target size p, allowing the cache to react to workload changes without manual tuning.

2Q

2Q uses a FIFO queue for first‑time accesses and an LRU queue for items accessed at least twice, offering a cheap approximation of ARC.

TinyLFU

TinyLFU adds a small admission filter (Doorkeeper + Count‑Min Sketch) in front of an LRU window. The filter estimates global access frequency, allowing the cache to admit only likely‑hot items while keeping memory overhead low.

Evaluation

Environment : Intel Xeon Gold 5118 (2.30 GHz), 16 GB L3 cache, Linux 3.10 kernel, RocksDB built with GCC 12 ‑O2. SsdEngine is linked into a single‑node RocksDB instance.

Workloads :

10 M keys (~2 pages per key).

Access patterns: Zipfian (s = 0.6 and s = 0.9) and uniform.

Cache sizes: 4 GB, 6 GB, 8 GB, 10 GB, 12 GB.

Results

Zipfian s = 0.6 : LRU achieves higher hit ratios than DoubleClock, especially with larger caches. The tail‑latency (9999‑th percentile) is also lower for LRU.

Zipfian s = 0.9 : Same trend – LRU retains a clear advantage in hit rate and tail latency.

Uniform distribution : LRU and DoubleClock perform similarly; the extra complexity of DoubleClock provides no measurable benefit.

TinyLFU vs. LRU : TinyLFU adds a modest hit‑rate improvement across all distributions, most noticeable when the cache is small relative to the key space. It requires a warm‑up period before the filter stabilises.

Key Takeaways

For workloads with strong locality (Zipfian), pure LRU outperforms approximate algorithms.

When access is uniform, simpler policies are sufficient.

Frequency‑aware admission (TinyLFU) can boost performance for limited cache sizes.

Flying pointer‑swizzling and the TLS‑central cache split reduce duplicate I/O and contention in read‑heavy scenarios.

Cachingperformance evaluationSSDStorage Systemseviction algorithmslayered cache
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.