Something similar happens in my company too. In a particular place, we use the h...

TeMPOraL · on Feb 6, 2025

Isn't a hash map supposed to already be internally hashing the string for you and correctly handling collisions?

pornel · on Feb 6, 2025

Yikes.

It's valid to assume "it will never happen" for 128 bits or more (if the hash function isn't broken) since chance of a random collision is astronomically small, but a collision in 64 bits is within realm of possibility (50% chance of hitting a dupe among 2^32 items).

hansvm · on Feb 7, 2025

> valid, 128 bits

The birthday paradox is a thing. If you have 128 bits of entropy, you expect the 50% mark to be proportional to 64-bit keys, not 128 bits. 64 bits is a lot, but in my current $WORK project if I only had 128 bits of entropy the chance of failure any given year would be 0.16%. That's not a lot, but it's not a negligible amount either.

Bigger companies care more. Google has a paper floating around about how "64 bits isn't as big as it used to be" or something to that effect, complaining about how they're running out of 64-bit keys and can't blindly use 128-bit random keys to prevent duplication.

> bits of entropy

Consumer-grade hash functions are often the wrong place to look for best-case collision chances. Take, e.g., the default Python hash function which hashes each integer to itself (mod 2^64). The collision chance for truly random data is sufficiently low, but every big dictionary I've seen in a real-world Python project has had a few collisions. Other languages usually make similar tradeoffs (almost nobody uses crytographic hashes by default since they're too slow). I wouldn't, by default, trust a generic 1-million-bit hash to not have collisions in a program of any size and complexity. 128 bits, even with low enough execution counts to otherwise make sense, is also unlikely to pan out in the real world.

pornel · on Feb 7, 2025

I agree that 128 bits is on the lower end of "never", but you still need to store trillions of hashes to have a one-in-a-trillion chance to see a collision (and that's already the overall probability, you don't multiply it by the number of inserts to get 1:1 chance :) I don't think anybody in the world has ever seen a collision of a cryptographically strong 128-bit hash that wasn't a bug or attack.

Birthday paradox applies when you store the items together (it's a chance of collision against any existing item in the set), so overall annual hashing churn isn't affected (more hashes against a smaller set doesn't increase your collision probability as quickly).

hansvm · on Feb 7, 2025

Based on currently available public estimates, Google stores around 2^75 bytes, most of that backed by a small number of very general-purpose object stores. A lot of that is from larger files, but you're still approaching birthday-paradox numbers for in-the-wild 128-bit hash collisions.

GoblinSlayer · on Feb 7, 2025

Hashtables have collisions because they don't use all bits of hash, they calculate index=hash%capacity. It doesn't matter, how you calculate the hash, if you have only a few places to insert an item, they will collide.

hansvm · on Feb 7, 2025

Right, but the problem they were describing was storing the "hash" in a hash table, not storing the item using a hash. For that, it absolutely matters, and the fact that it was a 128-bit hash IMO isn't good enough because the hash function itself likely sucks.

klysm · on Feb 6, 2025

This makes no sense - it’s a hash map because it hashes things for you…

masklinn · on Feb 6, 2025

A hash map still stores the entire key, which may be undesirable if the key datum can be large.