The basic idea
A hash table maps keys to values using a hash function. The hash function converts the key into an array index. If two keys hash to the same index (a collision), you need a strategy to handle it.
The two main collision strategies are chaining, where each slot holds a linked list of entries, and open addressing, where the table searches for another slot in the same array. Modern high-performance hash tables tend to use open addressing because the probe sequence stays close in memory.
Open addressing with linear probing
The simplest open addressing scheme. When you get a collision, check the next slot. Then the next. Keep going until you find an empty slot.
struct HashMap<K, V> {
buckets: Vec<Option<(K, V)>>,
len: usize,
}
impl<K: Hash + Eq, V> HashMap<K, V> {
fn insert(&mut self, key: K, value: V) {
let hash = self.hash(&key);
let mut idx = hash % self.buckets.len();
loop {
match &self.buckets[idx] {
None => {
self.buckets[idx] = Some((key, value));
self.len += 1;
return;
}
Some((k, _)) if k == &key => {
self.buckets[idx] = Some((key, value));
return;
}
_ => {
idx = (idx + 1) % self.buckets.len();
}
}
}
}
}Linear probing has a useful property: the probe sequence is sequential in memory. When you check slot N and it is occupied, slot N+1 is likely already in the same cache line. Chaining usually turns the same lookup into pointer chasing.
The downside: clustering. Once a run of occupied slots forms, it attracts more entries (they all probe into the same cluster). This makes lookup times degrade as the table fills up.
Load factor
The load factor is entries / capacity. As it increases, collisions become more frequent and performance degrades.
| Load factor | Avg probes (linear) | Avg probes (Robin Hood) |
|---|---|---|
| 50% | 1.5 | 1.5 |
| 70% | 2.2 | 1.8 |
| 80% | 3.0 | 2.0 |
| 90% | 5.5 | 2.6 |
| 95% | 10.5 | 3.2 |
Most hash tables resize when the load factor exceeds 70-80%. Resizing allocates a larger array and re-inserts every entry. The resize itself is O(n). Spread across the insertions that led to it, the average insertion cost remains O(1).
Robin Hood hashing
Robin Hood hashing is a refinement of linear probing. During insertion, if the new entry has traveled farther from its ideal slot than the entry currently occupying the slot, they swap places. The goal is to keep probe distances more even.
fn insert(&mut self, key: K, value: V) {
let hash = self.hash(&key);
let mut idx = hash % self.capacity;
let mut entry = (key, value);
let mut dist = 0;
loop {
match &self.buckets[idx] {
None => {
self.buckets[idx] = Some(entry);
return;
}
Some(existing) => {
let existing_dist = self.probe_distance(idx, existing);
if dist > existing_dist {
let displaced = self.buckets[idx].take().unwrap();
self.buckets[idx] = Some(entry);
entry = displaced;
dist = existing_dist;
}
}
}
idx = (idx + 1) % self.capacity;
dist += 1;
}
}This keeps the maximum probe distance low. In a standard linear probing table, one unlucky entry might be 50 slots away from its ideal position. With Robin Hood hashing, long-distance entries move closer and the overall variance drops.
Rust's standard HashMap used Robin Hood hashing until it switched to a Swiss Table implementation in 2019.
Swiss Tables (hashbrown)
Google's Swiss Table design, used in absl::flat_hash_map and Rust's hashbrown/std::HashMap, groups slots into batches of 16 and uses SIMD instructions to check the group at once.
Each group has a 16-byte metadata array where each byte stores the top 7 bits of the hash (called H2) for the entry in that slot, or a special marker for empty/deleted slots.
// Simplified Swiss Table probe
Group group = load_group(ctrl + bucket);
BitMask match = group.match(h2); // SIMD comparison
for (int i : match) {
if (keys[bucket + i] == key) {
return &values[bucket + i];
}
}
if (group.has_empty()) {
return nullptr; // key not in table
}
// continue to next groupThe group.match(h2) operation uses SSE2 _mm_cmpeq_epi8 to compare the H2 byte against all 16 control bytes in a single instruction. This means checking 16 slots costs about the same as checking one slot in a traditional hash table.
Why this matters in practice
For small hash tables (under 100 entries), the choice of hash table implementation rarely matters. The total memory fits in L1 cache regardless, and any reasonable implementation will be fast.
For large tables (millions of entries) or high-throughput scenarios (millions of lookups per second), the implementation matters a lot:
- Cache-friendly layout (open addressing) beats pointer-chasing (chaining) by 2-5x
- SIMD-accelerated probing (Swiss Tables) beats scalar probing by 1.5-3x on modern CPUs
- A good hash function (wyhash, ahash) can be 5x faster than a mediocre one (SipHash for non-adversarial inputs)
Rust's HashMap already gives you a Swiss Table with a solid default hash function. In C++, absl::flat_hash_map is often a better starting point than std::unordered_map.
