Hash Table Collisions and Resolution Strategies
The article explains how limited hash spaces cause collisions, then compares open hashing (separate chaining) and closed hashing techniques—linear probing, double hashing, and random hashing—detailing their insertion, search, and deletion algorithms along with the performance trade‑offs and load‑factor analysis for each method.
Hash functions compress data to improve efficiency, but limited hash space and growing data cause hash collisions, which are a major challenge for effective compression. This article introduces hash collisions, various resolution methods, and discusses the advantages and disadvantages of each strategy.
In a hash table, the hash function maps a key to an index in the table. Although the set of possible keys can be huge, the set of hash values is limited to the table size. Hash functions are also used in password systems, message‑digest systems, and digital signatures, where a very low collision probability is required.
Common hash algorithms include:
MD5 : 2 128 possible values (approximately 2 64 trials to find a collision).
SHA‑1 : 2 160 possible values (approximately 2 80 trials to find a collision).
Hash Collisions
Consider a simple example with the hash function H(K) = K mod M (M = 7) and the keys 217, 701, 19, 30, 145.
H(K) = 217 % 7 = 0
H(K) = 701 % 7 = 1
H(K) = 19 % 7 = 2
H(K) = 30 % 7 = 2
H(K) = 145 % 7 = 5
The keys 19 and 30 collide at index 2.
Collision‑Resolution Strategies
Two broad families exist:
Open hashing (separate chaining): collided keys are stored in linked lists outside the main table.
Closed hashing (open addressing): collided keys are stored in alternative slots within the table.
Popular closed‑addressing techniques:
Linear probing
Double hashing
Random hashing
Popular open‑hashing technique:
Separate chaining
1. Linear Probing
Insertion algorithm:
Set indx = H(K) .
If slot indx already contains the key, do nothing.
If slot indx is empty, insert the key.
Otherwise, set indx = (indx + 1) mod M and repeat.
If indx == H(K) , the table is full and must be resized.
Advantages: easy to implement, always finds a slot if one exists, good average performance when the table is not heavily loaded.
Disadvantages: creates primary clustering, which degrades performance as the table becomes dense.
Example with M = 7 and keys 701, 145, 217, 19, 13, 749:
H(K) = 701 % 7 = 1
H(K) = 145 % 7 = 5
H(K) = 217 % 7 = 0
H(K) = 19 % 7 = 2
H(K) = 13 % 7 = 1 (collision) → 2 (occupied) → 3 (inserted)
H(K) = 749 % 7 = 2 (collision) → 3 (occupied) → 4 (inserted)
Retrieval algorithm (linear probing):
Set indx = H(K) .
If slot indx contains the key, return FOUND.
If slot indx is empty, return NOT FOUND.
Otherwise, set indx = (indx + 1) mod M and repeat.
If indx == H(K) , the key is not present; the table may need resizing.
Deletion is problematic because simply marking a slot empty can break the probe sequence for keys that collided later.
2. Double Hashing
Uses a second hash function H₂(K) to compute the probe offset, avoiding the clustering of linear probing.
For a table of size M (preferably prime), a common choice is H₂(K) = 1 + ((K / M) mod (M‑1)) , which yields an offset in the range 1 … M‑1.
Insertion algorithm:
Set indx = H(K) ; offset = H₂(K) .
If slot indx already contains the key, do nothing.
If slot indx is empty, insert the key.
Otherwise, set indx = (indx + offset) mod M and repeat.
If indx == H(K) , the table is full.
Double hashing works well when the table size is prime.
Reference: https://blog.csdn.net/chenxuegui1234/article/details/103454285
3. Random Hashing
Similar to double hashing, but the probe sequence is generated by a pseudo‑random number generator seeded with the key.
Insertion algorithm:
Create an RNG seeded with K; set indx = RNG.next() mod M .
If slot indx already contains the key, do nothing.
If slot indx is empty, insert the key.
Otherwise, set indx = RNG.next() mod M and repeat.
If all M positions are probed without success, the table must be resized.
Random hashing is easy to analyze but incurs the overhead of RNG calls, so double hashing is more commonly used in practice.
4. Separate Chaining
When inserting a key K:
Set indx = H(K) .
Insert the key into the linked list stored at bucket indx (first search the list to avoid duplicates).
Search and delete operations traverse the same linked list.
Advantages: average performance remains good as the number of entries grows (load factor α can exceed 1); deletion is straightforward.
Disadvantages: extra memory for pointers and poorer cache locality.
Java 7’s HashMap is an implementation of separate chaining.
Separate‑Chaining Analysis
Load factor α = N / M (N = number of stored keys, M = table size). For α > 1, separate chaining still offers O(1) average cost for successful and unsuccessful searches because the expected chain length is α.
Successful search cost (average number of comparisons):
Unsuccessful search cost (average number of comparisons):
When α < 1, these costs are less than 1 and 1.5 respectively; even for α > 1 they remain O(1) and independent of N.
Open Addressing (Closed Hashing) Analysis
Using the same load factor α, the average number of probes for an unsuccessful insertion/search with random hashing is 1/(1‑α). For linear probing, clustering increases the average probe count, approximated by the formula shown below:
For successful searches, random hashing yields an average probe count of
Linear probing’s average successful probe count is
When α approaches 1, both strategies degrade, but for α ≤ 7.75 the average costs remain low and essentially independent of M.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.