When std::map Beats std::unordered_map: The Cache‑Level Secret
Although std::unordered_map has O(1) average complexity and std::map has O(log N), the actual runtime depends on memory layout and cache behavior, so at certain data sizes the tree‑based map can be faster due to fewer cache misses.
Memory layout of the containers
std::map is implemented as a red‑black tree. Each node is allocated with new and contains the key, value, left/right/parent pointers and a colour bit. On a 64‑bit platform a node occupies roughly 32‑48 bytes, which fits into a 64‑byte cache line, but the nodes are placed at essentially random addresses because each allocation is independent.
std::unordered_map uses a hash table with separate chaining (the implementation used by libstdc++ and libc++). It consists of two parts:
A contiguous bucket array that holds the heads of the chains.
Linked‑list nodes attached to each bucket; each node is also allocated separately.
Thus the bucket array is cache‑friendly, but traversing a chain re‑introduces pointer chasing.
std::map (red‑black tree):
Node@0x1000 → Node@0xa800 → Node@0x3400 → Node@0xf200
// each step jumps to a non‑contiguous address → new cache line
std::unordered_map (hash table):
Bucket[0] Bucket[1] Bucket[2] … Bucket[N] ← contiguous array
| |
Node@0x5100 Node@0x8700 → Node@0xc300 ← linked‑list nodes, non‑contiguousCache‑hierarchy latencies
L1 cache ≈ 1 ns (1×)
L2 cache ≈ 4 ns (4×)
L3 cache ≈ 12 ns (12×)
DRAM ≈ 60‑100 ns (60‑100×)
A DRAM access therefore costs the equivalent of 60‑100 L1 loads, making cache misses the dominant factor in lookup latency.
std::map::find()
Finding a key walks a path of O(log₂ N) nodes. For N = 1 000 the path is about 10 nodes; for N = 1 000 000 it is about 20 nodes. Because each child node is allocated independently, the child is almost never on the same cache line as its parent. In the worst case every level incurs a cache miss.
If the tree no longer fits in L3, each miss costs roughly 80 ns (DRAM latency). Ten misses therefore take ≈ 800 ns, far exceeding the few nanoseconds spent on the key comparisons. Hardware prefetchers help only for linear or stride‑based accesses and cannot predict the branch‑dependent tree walk.
std::unordered_map::find()
The operation proceeds in two stages:
Compute the hash of the key and take the modulo to obtain a bucket index. The bucket array is contiguous, so accessing a bucket is usually a single cache hit (or at most one miss).
Traverse the linked list stored in that bucket. If there is no collision, the key is found after reading one node. Each additional node adds another pointer chase and a potential cache miss.
In the ideal case (load factor ≈ 1, few collisions) a find() incurs only 1‑2 cache misses: one for the bucket array and one for the first list node.
For N = 1 000 a std::map may cause ~10 cache misses, whereas a std::unordered_map typically causes only 1‑2, making the hash‑based container noticeably faster. The speed advantage comes from fewer cache misses, not from a lower algorithmic step count.
Performance crossover point
The break‑even size depends on three factors:
Key type : hashing a std::string is expensive, pushing the crossover to larger N; hashing an int is cheap.
Cache size : Larger L3 caches can hold more tree nodes, extending the range where std::map remains competitive.
Access pattern : Temporal locality (repeated lookups of the same keys) keeps hot nodes in higher‑level caches, benefiting the tree.
Rough guidelines for typical modern CPUs:
N < ≈ 50‑100 : The whole tree often fits in L1/L2; std::map can be as fast or faster.
N ≈ 100‑1 000 : The tree starts spilling out of the high‑level caches; performance depends on key type and CPU.
N > ≈ 1 000 : Cache‑miss count for the tree exceeds the fixed hash‑lookup cost, so std::unordered_map is almost always faster.
Cache‑friendly alternatives
Sorted std::vector , boost::flat_map or C++23 std::flat_map : All elements reside in a single contiguous array. Binary search provides O(log N) lookups with excellent cache locality. Benchmarks show they often outperform both maps for N up to ~10 000.
Open‑addressing hash tables such as absl::flat_hash_map or Boost.Unordered's flat variants: Elements are stored directly in a single array, eliminating pointer chasing. They retain average O(1) lookup time while being much more cache‑friendly.
Practical container selection
Very small data sets (N < 100) with lookup‑intensive code : Prefer a sorted std::vector or std::flat_map for contiguous storage and fast binary search.
Small data sets that require ordered iteration : Use std::map or std::flat_map; the latter adds cache benefits.
Large data sets (N > 1 000) where lookups dominate : Choose an open‑addressing hash table ( absl::flat_hash_map or similar) for O(1) average time and contiguous memory.
Large data sets that also need ordering : std::map remains the only container that guarantees order with stable O(log N) performance.
Uncertain workload : Start with std::unordered_map, then profile on the target hardware.
Measuring cache behaviour
Use Linux perf to collect cache‑miss statistics for your actual workload:
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses \
./your_benchmarkCompare L1-dcache-load-misses and LLC-load-misses for the candidate containers; the container with fewer misses will usually give the best real‑world performance.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
