Fundamentals 17 min read

Can Embracing Hash Collisions Boost Performance? Inside B16 Hash Table

This article revisits traditional hash table design, then introduces a novel approach that deliberately leverages a controlled probability of hash collisions combined with SIMD parallelism, presenting the B16 and B16Compact hash tables, their structures, algorithms, and experimental results showing superior speed and space efficiency compared to unordered_map and F14.

Baidu Geek Talk

Feb 22, 2021

Can Embracing Hash Collisions Boost Performance? Inside B16 Hash Table

1 Background

Hash tables are a fundamental data structure with O(1) average lookup time, yet real‑world implementations differ widely in performance. Engineers continuously explore designs that improve both speed and memory usage.

2 Rethinking Hash Collisions

Traditional designs try to avoid collisions, often using perfect hash functions that map a fixed key set without conflict. Perfect hashing, however, is limited to static key sets, incurs construction cost, and adds storage overhead.

3 Leveraging Collisions with SIMD

Modern CPUs provide SIMD (Single Instruction Multiple Data) instruction sets (SSE4.2, AVX, NEON) that can compare many values in parallel. By intentionally allowing a higher collision probability, we can use SIMD to filter multiple candidates simultaneously, reducing per‑key comparison cost.

3.1 SIMD Instructions

SIMD enables a single instruction to operate on vectors of data (e.g., 128‑bit registers can compare sixteen 8‑bit values at once). While widely used in graphics and neural‑network workloads, many general‑purpose applications under‑utilize SIMD for hash lookups.

3.2 F14 Hash Table (Facebook Folly)

F14 stores keys in small blocks and uses SIMD to compare an 8‑bit secondary hash (H2) against up to 16 candidates in parallel. The design deliberately creates more collisions, then resolves them efficiently inside each block.

Two hash codes are computed: H1 selects the block, H2 (8 bits) is used for SIMD filtering.

Each block holds up to 14 elements; the remaining bytes store control information.

Insertion inserts directly if space exists; otherwise it overflows to the next block.

Lookup loads the block header into a SIMD register and compares H2 against all stored H2 values in parallel.

The 8‑bit H2 yields a theoretical collision probability of 1/256, but SIMD parallelism eliminates most per‑key comparisons, allowing the table to tolerate higher collision rates.

4 B16 Hash Table

Inspired by F14, B16 replaces the control bytes with a simple chaining approach, allocating 16 slots per block (hence the name B16). This removes per‑block control overhead and enables a straightforward linked‑list of blocks.

4.1 B16 Data Structure

Each key is hashed to two codes, H1 and H2. H1 determines the bucket, and H2 (8 bits) is stored in the block. A block contains:

16 bytes header: 16 × 8‑bit H2 values.

16 slots for key/value pairs.

Insertion places the key in the first empty slot of the target block; if the block is full, a new block is linked. Lookup computes H1 and H2, loads the block header into a SIMD register, and performs a 16‑way parallel comparison of H2 values. Matching H2 triggers a full key comparison; otherwise the next block is examined.

4.2 B16Compact Read‑Only Hash Table

B16Compact compresses the structure for read‑only use:

All blocks are concatenated into a single array, eliminating per‑block next pointers.

Bucket entries store only the index of the first block for that bucket.

A tail bucket records the index of the final block.

Lookup proceeds similarly, but the contiguous block layout improves cache locality. Theoretical extra storage is 9.23 bits/key for 1 M entries at Load Factor 13, comparable to minimal perfect hashing while avoiding construction failures.

5 Experimental Evaluation

5.1 Setup

Keys and values are uint64_t. Random generators produce input arrays. All tables are pre‑sized to avoid rehashing. Measurements:

Insertion latency: total time for N sequential inserts divided by N (ns/key).

Lookup latency: 200 k successful key lookups + 200 k value lookups (may miss) divided by 400 k (ns/key).

Memory usage: total allocated bytes divided by number of entries (bytes/key).

Compilation flags:

Folly (F14) and B16: -mavx -O2, default Load Factor.

B16: Load Factor set to 13.

std::unordered_map: default Ubuntu version, default Load Factor.

Hardware: 4‑core, 8 GB RAM CentOS 7 VM on an Intel Xeon Gold 6148 (2.40 GHz), containerized on Ubuntu 20.04 Docker.

5.2 Results

Insertion performance shows B16 achieving lower latency than unordered_map while using significantly less memory. For key counts below 1 M, B16 outperforms F14; above 10 M, F14 becomes faster due to better memory locality.

Lookup benchmarks reveal B16 and B16Compact maintaining superior speed despite their smaller memory footprint. B16Compact stores roughly 17.31 bytes/key (≈1.31 bytes extra per key) and offers competitive query latency, making it attractive for memory‑constrained environments such as embedded devices.

6 Conclusion

The study demonstrates that shifting from collision avoidance to controlled collision exploitation, combined with SIMD parallelism, can dramatically improve hash table speed and compress storage. B16 matches or exceeds F14 in many scenarios, and B16Compact approaches the space efficiency of minimal perfect hashing while retaining fast lookups, offering a practical solution for high‑performance, memory‑tight applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance C#Data Structures SIMD hash table B16

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.