Choosing the Right Hash Algorithm: Thomas Wang vs. MurmurHash
This article explains how to select appropriate hash functions for different key types, compares Thomas Wang's integer mix function with MurmurHash for strings, and details rehash strategies, load factor thresholds, and value storage structures in hash tables.
Hash Algorithm Selection
Different key types should use different hash algorithms: integers, case‑sensitive strings, and generic strings each benefit from a specialized hash function.
Integer Hash – Thomas Wang's Mix Function
The integer hash uses Thomas Wang's 32/64‑bit mix function, which relies on bit‑shift operations to avoid multiplication and improve performance.
unsigned int dictIntHashFunction(unsigned int key) {
key += ~(key << 15);
key ^= (key >> 10);
key += (key << 3);
key ^= (key >> 6);
key += ~(key << 11);
key ^= (key >> 16);
return key;
}String Hash – MurmurHash
MurmurHash, created by Austin Appleby in 2008, offers high performance and low collision rates and is used in many open‑source projects such as Hadoop, nginx, and libmemcached. Google later derived CityHash from it.
unsigned int dictGenHashFunction(const void *key, int len) {
uint32_t seed = dict_hash_function_seed;
const uint32_t m = 0x5bd1e995;
const int r = 24;
uint32_t h = seed ^ len;
const unsigned char *data = (const unsigned char *)key;
while(len >= 4) {
uint32_t k = *(uint32_t*)data;
k *= m;
k ^= k >> r;
k *= m;
h *= m;
h ^= k;
data += 4;
len -= 4;
}
switch(len) {
case 3: h ^= data[2] << 16;
case 2: h ^= data[1] << 8;
case 1: h ^= data[0]; h *= m;
}
h ^= h >> 13;
h *= m;
h ^= h >> 15;
return (unsigned int)h;
}What Makes a Good Hash Function?
High performance – the computation must be fast.
Good distribution – even for sequential or patterned inputs, the output should appear random.
For example, MurmurHash of "abc" is 1118836419 while "abd" is 413429783, demonstrating better dispersion than a simple Horner method.
Rehashing
The load factor is current_node_count / bucket_count. When it exceeds 1, collisions become inevitable and entries are linked via chaining.
All hash tables start with 4 buckets; they resize based on load factor changes. Expansion occurs when the load factor > 1 (or > 5 during BGSAVE/BGREWRITEEOF), and shrinking when it falls below 0.01.
During expansion, a new table with a bucket count that is the smallest power of two greater than the current size is created, and all entries are migrated.
Progressive Rehash
Large tables cannot be rehashed in a single step; they are migrated incrementally. While rehashing, lookups first check the old table (hash0) and then the new table (hash1). Inserts go directly into hash1, and the process finishes when hash0 becomes empty.
Value Storage
Key‑value pairs use a union to store values of different types (pointer, 64‑bit unsigned, or signed integer).
// key
void *key;
// value
union {
void *val;
uint64_t u64;
int64_t s64;
} v;Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
