Bloom Filters by Example

1. verytrivial ◴[29 Jun 25 15:26 UTC] No.44413866[source]▶

The overlap in concepts required to understand Bloom filters, sets and hash tables is about 95% IMHO. A set is a hash table used for membership tests where you only care about the key, not the value. And a Bloom filter is just a set that exploits the fact that many-to-one hashing 'compresses' the key-space with collisions. It deliberately uses a very collide-y hash function. If a specific key was ever hashed, you WILL get a hit, but there might be other keys that produced the same hash. It's a feature, not a bug.

replies(3): >>44414062 #>>44414176 #>>44416360 #

2. marginalia_nu ◴[29 Jun 25 15:52 UTC] No.44414062[source]▶

>>44413866 (TP) #

If you've grokked bloom filters, you're very close to also understanding both random projection and certain implementations of locality-sensitive hashes.

3. cherrycherry98 ◴[29 Jun 25 16:09 UTC] No.44414176[source]▶

>>44413866 (TP) #

Glad to know I'm not alone in my mental modeling of Bloom filters as just hash tables that only track the buckets which have data but not the actual data itself.

4. cortesoft ◴[29 Jun 25 20:57 UTC] No.44416360[source]▶

>>44413866 (TP) #

I think the main bit your explanation is missing is how a bloom filter uses multiple hash functions to reduce collisions. For example, a bloom filter might have 3 hashes, and all of them have to hit for a key to be known to be in the set. This reduces the likelihood of false positive collisions while keeping the no false negatives guarantee.