Big Data 13 min read

Scalable In‑Memory Cache Design for DMP Using Hash Bucketing and MD5‑Based Key Compression

The article presents a large‑scale DMP caching solution that stores billions of third‑party IDs and demographic tags in memory by compressing keys with MD5‑derived bucket IDs, applying eviction policies, reducing memory fragmentation, and demonstrating up to 78% memory savings.

Architecture Digest

Aug 29, 2016

Scalable In‑Memory Cache Design for DMP Using Hash Bucketing and MD5‑Based Key Compression

1. Requirement Background The DMP needs to manage massive third‑party ID data, including mappings between media cookies and internal IDs (supperid), demographic tags for IDs, black‑list IDs, and IPs. Offline storage on HDFS can handle trillions of records, but the DMP also requires millisecond‑level real‑time queries, which makes caching a critical challenge.

2. Data to Store The system stores population tags (age, gender, geo) for cookies, IMEI, IDFA, and the mapping from media cookie to supperid. Example key‑value formats include: - Media ID: mediaId‑mediaCookie → supperid - Supperid: supperid → {age, gender, geo} - Device ID: imei or idfa → {age, gender, geo} 3. Data Characteristics The keys are short (21‑digit supperid, 32‑char MD5 strings) but highly variable in length; the volume is massive (hundreds of billions of keys, billions of daily new mappings) and the data must be retained for at least 35 days.

4. Technical Challenges

Variable key lengths cause memory fragmentation.

Heavy pointer usage leads to a memory inflation factor of about 7×.

Frequent generation of new IDs prevents effective pre‑warming.

Service‑level latency requirement (<100 ms) forces all fresh mappings to stay in memory.

High memory cost for storing tens of billions of keys.

5. Solution

5.1 Eviction Strategy Use TTL (35 days) and Redis key expiration with renewal on access to discard cold data while keeping hot IDs alive.

5.2 Reducing Inflation Store keys in a fixed‑length bucket ID derived from an MD5 hash. The process is: 1) Compute md5(key) → bucket ID (BucketId). 2) Store key → value inside a hashmap under that bucket. 3) Query by computing the same bucket ID and then fetching the key from the hashmap.

By allowing multiple keys to collide in the same bucket (e.g., ~10 keys per bucket), the total number of Redis keys can be reduced by over 90 %.

The implementation uses a custom hash that keeps only the necessary bits (e.g., 30‑31 bits for ~1 billion buckets) and encodes them as ASCII characters to halve the key length.

public static byte[] getBucketId(byte[] key, Integer bit) {
    MessageDigest mdInst = MessageDigest.getInstance("MD5");
    mdInst.update(key);
    byte[] md = mdInst.digest();
    byte[] r = new byte[(bit-1)/7 + 1]; // only 7 bits per byte are usable as ASCII
    int a = (int) Math.pow(2, bit%7) - 2;
    md[r.length-1] = (byte) (md[r.length-1] & a);
    System.arraycopy(md, 0, r, 0, r.length);
    for (int i = 0; i < r.length; i++) {
        if (r[i] < 0) r[i] &= 127;
    }
    return r;
}

5.3 Reducing Fragmentation By making Redis keys fixed‑length and truncating the last six characters of cookies/device IDs for hashmap keys, memory alignment improves and fragmentation drops. Additional low‑level tricks such as restarting slaves for failover and using tcmalloc/jemalloc further reduce fragmentation.

6. MD5 Bucket Method Considerations

Bucket count must be planned (e.g., 30‑31 bits for ~1 billion buckets) to accommodate growth.

Suitable only for very small values (few bytes).

Time‑for‑space trade‑off is acceptable given the query rate (10⁸‑10⁹ ops/day).

Key random access is impossible; export must be done from cold storage.

Expiration logic is custom: on write, sample and delete entries when a bucket exceeds 15 items, storing TTL in the first 32 bits of the value.

7. Test Results For 10 billion records, memory usage dropped from 2.3 TB (fragmentation ~2) to 500 GB (fragmentation ~1.02), saving ~1.8 TB (≈78 %). Bucket consumption follows a binomial distribution; most buckets hold ~4 entries, with a few outliers.

Conclusion Using MD5‑based bucket IDs compresses keys, aligns memory, distributes load evenly, and eliminates the need for per‑key expiration, achieving substantial memory savings while meeting strict latency requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Memory optimization MD5 DMP In-Memory Cache hash bucketing

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.