Fundamentals 6 min read

Choosing the Right Hash Algorithm: Thomas Wang vs. MurmurHash

This article explains how to select appropriate hash functions for different key types, compares Thomas Wang's integer mix function with MurmurHash for strings, and details rehash strategies, load factor thresholds, and value storage structures in hash tables.

21CTO
21CTO
21CTO
Choosing the Right Hash Algorithm: Thomas Wang vs. MurmurHash

Hash Algorithm Selection

Different key types should use different hash algorithms: integers, case‑sensitive strings, and generic strings each benefit from a specialized hash function.

Integer Hash – Thomas Wang's Mix Function

The integer hash uses Thomas Wang's 32/64‑bit mix function, which relies on bit‑shift operations to avoid multiplication and improve performance.

unsigned int dictIntHashFunction(unsigned int key) {
    key += ~(key << 15);
    key ^= (key >> 10);
    key += (key << 3);
    key ^= (key >> 6);
    key += ~(key << 11);
    key ^= (key >> 16);
    return key;
}

String Hash – MurmurHash

MurmurHash, created by Austin Appleby in 2008, offers high performance and low collision rates and is used in many open‑source projects such as Hadoop, nginx, and libmemcached. Google later derived CityHash from it.

unsigned int dictGenHashFunction(const void *key, int len) {
    uint32_t seed = dict_hash_function_seed;
    const uint32_t m = 0x5bd1e995;
    const int r = 24;
    uint32_t h = seed ^ len;
    const unsigned char *data = (const unsigned char *)key;
    while(len >= 4) {
        uint32_t k = *(uint32_t*)data;
        k *= m;
        k ^= k >> r;
        k *= m;
        h *= m;
        h ^= k;
        data += 4;
        len -= 4;
    }
    switch(len) {
        case 3: h ^= data[2] << 16;
        case 2: h ^= data[1] << 8;
        case 1: h ^= data[0]; h *= m;
    }
    h ^= h >> 13;
    h *= m;
    h ^= h >> 15;
    return (unsigned int)h;
}

What Makes a Good Hash Function?

High performance – the computation must be fast.

Good distribution – even for sequential or patterned inputs, the output should appear random.

For example, MurmurHash of "abc" is 1118836419 while "abd" is 413429783, demonstrating better dispersion than a simple Horner method.

Rehashing

The load factor is current_node_count / bucket_count. When it exceeds 1, collisions become inevitable and entries are linked via chaining.

All hash tables start with 4 buckets; they resize based on load factor changes. Expansion occurs when the load factor > 1 (or > 5 during BGSAVE/BGREWRITEEOF), and shrinking when it falls below 0.01.

During expansion, a new table with a bucket count that is the smallest power of two greater than the current size is created, and all entries are migrated.

Progressive Rehash

Large tables cannot be rehashed in a single step; they are migrated incrementally. While rehashing, lookups first check the old table (hash0) and then the new table (hash1). Inserts go directly into hash1, and the process finishes when hash0 becomes empty.

Value Storage

Key‑value pairs use a union to store values of different types (pointer, 64‑bit unsigned, or signed integer).

// key
void *key;
// value
union {
    void *val;
    uint64_t u64;
    int64_t s64;
} v;
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data Structureshash functionsRehashMurmurHashThomas Wang
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.