Databases 11 min read

How to Store Billions of IDs in Redis Without Running Out of Memory

This article examines the challenges of storing massive DMP ID mappings in Redis—including memory fragmentation, expansion, and latency constraints—and presents eviction, bucket‑hashing, and fragmentation‑reduction techniques to achieve efficient, real‑time, large‑scale key‑value storage.

ITFLY8 Architecture Home

May 26, 2021

How to Store Billions of IDs in Redis Without Running Out of Memory

1 Requirement Background

The scenario is DMP cache storage: DMP manages massive third‑party IDs, including media cookies and its own “supperid” mapping, plus demographic tags, mobile IDs (IDFA, IMEI), blacklists, IPs, etc.

Offline storage of hundreds of billions of records on HDFS is easy, but DMP must support millisecond‑level real‑time queries. Because cookies are unstable, many new cookies appear constantly; only timely synchronization of mapping data can hit the demographic tags. Pre‑warming cannot achieve high hit rates, posing a huge caching challenge.

Tests show that storing over 5 billion KV records requires more than 1 TB of memory; high‑availability multi‑replica further increases consumption, and variable‑length KV leads to severe memory fragmentation, demanding a massive‑scale storage solution.

2 What Data to Store

Demographic tags are associated with cookie, IMEI, IDFA and include gender, age group, geo, etc. Mapping relationships are media‑cookie to supperid. Example storage:

PC ID: media‑id‑cookie → supperid and supperid → {age, gender, geo}

Device ID: imei or idfa → {age, gender, geo}

Thus PC data needs two key‑value forms (key→value and key→hashmap), while device data needs only one (key→hashmap).

3 Data Characteristics

Short keys and short values (e.g., 21‑digit numeric supperid, MD5‑hashed IMEI, hyphenated MD5 for IDFA).

Media cookies have variable lengths.

Scale: supperid billions, media mappings tens of billions, mobile IDs tens of billions.

Billions of new mapping relations are generated daily.

Some data can be predicted as hot within a larger time window.

Many new mappings are unpredictable, leading to many fresh cookies.

4 Technical Challenges

Variable key lengths cause memory fragmentation.

Large number of pointers leads to memory expansion (≈7×).

Even with behavior‑based hot‑data prediction, daily new IDs are numerous.

Service must respond within 100 ms over public network (<60 ms latency), so today’s new mappings and tags must reside entirely in memory.

Business requirement to retain data for at least 35 days.

Memory is expensive; storing billions of keys requires a viable solution.

5 Solutions

5.1 Eviction Strategy

Because daily influx of new IDs consumes storage, timely eviction of cold data is essential. Identify hot data and discard cold data. IDs have limited lifetimes; many become invalid. By aggregating logs in HBase, deduplicating IDs, and setting a TTL of 35 days, IDs not seen for 35 days are removed. In Redis, set a 35‑day expiration and renew the TTL on each hit, effectively keeping stable cookies/IDs.

5.2 Reducing Expansion

Hash table size and key count determine collision rate. More keys mean larger hash tables and higher memory usage, compounded by many 64‑bit pointers. To reduce key count, store keys as fixed‑length random hashes (MD5) called BucketId, and keep the actual key‑value pairs inside a hashmap under that bucket.

Process: get(key) → hget(md5(key), key). By allowing many keys to collide in the same BucketId (e.g., 10 keys per bucket), we can cut the number of Redis keys by >90 %.

Standard MD5 yields a 128‑bit hex string, too large for billions of keys. We need ~33 bits (2³³ ≈ 8 billion). By using the full ASCII range (0‑127) instead of hex, we can halve the key length.

public static byte[] getBucketId(byte[] key, Integer bit) {
    MessageDigest mdInst = MessageDigest.getInstance("MD5");
    mdInst.update(key);
    byte[] md = mdInst.digest();
    byte[] r = new byte[(bit-1)/7 + 1]; // 7 usable bits per byte (ASCII)
    int a = (int) Math.pow(2, bit%7) - 2;
    md[r.length-1] = (byte) (md[r.length-1] & a);
    System.arraycopy(md, 0, r, 0, r.length);
    for(int i=0;i<r.length;i++) {
        if(r[i] < 0) r[i] &= 127;
    }
    return r;
}

Choosing 2³⁰ buckets (≈1 billion) allows each bucket to hold about 10 keys, matching the target scale.

5.3 Reducing Fragmentation

Fragmentation arises from misaligned memory and orphaned space after expiration. By storing demographic tags and mapping data with equal‑length keys (BucketId) and using the last six characters of cookie/device ID as hashmap keys, memory alignment improves. Values store only three bytes (age, gender, geo) codes.

Additional low‑cost technique: restart slaves and force a failover to let the master compact memory.

Recommend using tcmalloc (Google) or jemalloc (Facebook) for memory allocation; they can reduce fragmentation when values are small, though libc may be better for very large values.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data engineering Memory Optimization redis large-scale storage Key-value hashing

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.