The Beautiful Math of Bloom Filters

less_less 2 days ago

When the data is read-only, sparse linear filters get up to an O(ln 2)-factor smaller storage space at the cost of slower construction and inability to add items on the fly. These include xor-sat filters, xor/xor+ filters, smashed/bumped ribbon filters, frayed ribbon filters, binary fuse filters, probably a few other options.

The basic idea is that instead of table[hash1(x)] & table[hash2(x)] & ..., you calculate table[hash1(x)] ^ table[hash2(x)] ^ ... basically substituting XOR instead of AND. To construct the table, you need to solve a big system of linear equations. The various filter types change the parameters (mostly load factor of the filter) and indexing function (turning the hash into something where the bits you look up have correlated positions) in order to make structured equations that are easier to solve.

markcjeffrey 2 days ago

In addition to set membership queries, Bloom filters also support set intersection. The structure of the filter plays an important role in whether the false positive rate blows up: https://www.eecg.utoronto.ca/~mcj/papers/2011.bf-intersectio...

fmajid 2 days ago

Those are very basic math. If you want serious graduate level math, look at Flajolet’s use of the Mellon transform in HyperLogLog proofs.

redbell 2 days ago

I don't know why, but every time I see "Bloom Filters", this hits my mind: https://news.ycombinator.com/item?id=32186203

taeric 2 days ago

I have a small idea of a rant on how we seem to purposely make naming things hard. A common way we do that, in computers, is to drop the possessive. I wouldn't go into Bloom filters expecting something to justify the verb/action "bloom" being part of the name, if we always referred to it as Bloom's filter. (Similar for Shell's sort...)

kibwen 2 days ago

The practice of naming things after their inventors, while well-intentioned, introduces so much friction that it ought to just be abolished. Imagine how worse things would be if a directed acyclic graph were instead a "Euler path", or if a hash table were instead a "Luhn collection". The former terms may be jargon, but at least they're consistent, identifiable, and reasonably self-describing. I'm already frustrated that we're stuck with "boolean" as the most fundamental of all data types.
- swatcoder 2 days ago
  
  In reality, those kinds of names are generally not intended as some kind of trophy in the first place.
  It's more organic. They just grow out of colleagues, speaking amongst themselves, referring to a fellow colleague's particular idea or elaboration. They all share a common base understanding of their field, many know of the colleague directly, and all know how to look something up if they know the colleagues name and gist of the idea. And so that's how they refer to it.
  Once in a while, these so-named insights prove really important or lasting -- after the fact -- and the name continues to stick because it's the one everybody was using. Meanwhile, most of the time, the insights just kind of fade back into the baseline body of knowledge and either don't break out at all or evolve through some collaborative work that earns a more formal name.
- __MatrixMan__ 2 days ago
  
  I had a math teacher who warned us:
  > If you don't give your creations good names, they might name them after you.
  From his tone it was clear that this was something to be avoided. I don't know whether too late to retcon existing names but let's try to do better going forward.
- harperlee 2 days ago
  
  On the other hand, trying too hard to shoehorn semantic descriptions on names ends up with pathological cases (yes, chemistry, I’m looking at you!).
  Jokes apart, words are symbols that even if they have some semantics through etymology, in general they are quite arbitrary. I’d rather go with outlandish names that help mnemonics, if I were to choose. Names from people can serve that purpose; I still remember what a Kohonen map is, back from Uni, because of the childish resemblance with “cohone” (Andalusian for cojones), and a silly joke from a close friend.
  - __MatrixMan__ 2 days ago
    
    There's no winning with chemistry. It's either 2-ethyl-cis-alpha-nonsenium, or it's "the sonic hedgehog domain".
    It's like you get a choice between math hell or cartoon hell.
  - threatofrain 2 days ago
    
    But if we choose words that are famous names, they are far less likely to be systematically used as a building block. And woe be to us if that individual invents too many things, because then the meaning of their name will be too ambiguous.
- idle_zealot 2 days ago
  
  > I'm already frustrated that we're stuck with "boolean" as the most fundamental of all data types.
  We do have another name for it. "Bit." You could probably roll out a new programming language today that uses something like `let shouldUpdate: bit = true;` or without blowing too much of your novelty budget. Or `u1`, if you wanted to allow arbitrary integer sizes.
  - xen0 21 hours ago
    
    I think 'bit' risks confusing storage with the type (a Boolean is often stored using at least a byte, not a single bit).
- taeric 2 days ago
  
  My favorite is when we do have two words for the same thing. People out there really think there is a major distinction between stochastic and random.
  Or, worse, when people let the names of things keep them from learning them. Imaginary numbers being high on that list.
  - hinkley 2 days ago
    
    Imaginary numbers sound more approachable than complex numbers.
    
    jazzyjackson 2 days ago
    
    A personal matter I suppose, of how comfortable you are with imagining things :^)
    The real/imaginary dichotomy is the confusing part, having a rotational component doesn't make them any less real. O that reminds me, I'll just drop this link for my favorite video lecture on the matter full of visualizations, 13 parts, "Imaginary numbers are real" https://youtube.com/playlist?list=PLiaHhY2iBX9g6KIvZ_703G3KJ...
    If I had it my way the distinction would be straight and twisted.
TacticalCoder 2 days ago

I think two of the absolute worst names of them all are "dynamic programming" and "memoization". Don't get me wrong: I'll write you a 0-1 knapsack using DP in a hurry without any problem but it's just crazy that it's called "dynamic programming".
I don't care about the justification for the term "dynamic programming" nor for the term "memoization". They are just plain wrong.
Honestly compared to these two, "Bloom filter" sounds reasonable.
Heck, if "dynamic programming" had been called "Bellman programming" and "memoization" had been called "Mitchie memo" (by the name of their respective inventors), it would be less confusing.
So maybe, after all, that "Bloom filter" isn't that bad. Had we let the author pick a name, maybe he'd have picked "ephemeral spectrum" or something like that.
Yeah. I think naming after the name of the inventor is far from the worse actually. And it kinda gives credit where it's due too.
- taeric 2 days ago
  
  To be clear, my complaint here is that we stopped calling it Bloom's filter, and shortened that to Bloom Filter. At least Hoare's sort was shortened to "quick sort". :D
  We can leave it as named after someone, but that is a lot easier to understand if you keep the possessive. Plank's constant is a good one, in that vein.
- antonjs 2 days ago
  
  The story of why it is called dynamic programming is pretty great: https://conversableeconomist.com/2022/08/24/why-is-it-called...