Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I recently discovered "compact approximators", which can be seen as a generalisation of Bloom filters. Instead of telling you "element X is probably in the set", they tell you "a lower bound on the value associated with key X".

If there are no collisions, you get the true value. The data structure also doesn't "fill up" the way a Bloom filter does — the older data just gets probabilistically worse, so in some use cases you can keep using the one structure continuously.

My particular use case (which led me to "invent" the structure and then discover that of course it's already been described in the literature, haha) is implicit cache invalidation. The approximator structure(s) store keys like "latest time member X was removed from group Y". I can then use that to lazily decide which cached values I can still use, instead of doing a potentially expensive search to see what I need to invalidate at the time the group member was removed.

I'm new to using them, so I keep getting excited about new use cases — much like when I was first introduced to Bloom filters.



This is a bit similar to count-min sketch, which can be used to calculate frequencies in a stream.

Also interesting are "Invertible Bloom Lookup Tables" (IBLT), https://arxiv.org/pdf/1101.2245 , which allow to add and remove data, and allow to retrieve the remaining data if "little" data remains. That means, it can be used for error correction: all all the _correct_ data (let's say 1 GB of correct data) into a 1 MB IBLT. Let's say a downloader finds that the checksum doesn't match: he can download that 1 MB IBLT, and remove the data from his (invalid) download. What remains is the delta: the error correction data. I know, there are other ways you could do error correction, but this is very fast, and quite interesting technically.


IBLT operates over sets though, your downloading example is not over a set (the data has a position). You can make it work over data with positions by adding a position number to every encoding unit, but it's pure overhead. Other codes are more efficient.

IBLT is also particularly bandwidth inefficient when the amount of correction data is not very large. Ordinary codes can recover a missing block (known error location) with just the amount of data missing or with twice the amount if the location(s) are unknown.

There are codes with quasi-linear performance, though for your example application the number of errors are few so it shouldn't really matter if the code is cubic in the number of errors to decode (which is about the worst any should be).


Yes, I know it is not designed to be a error-correction code, and other codes (turbo code, fountain code), are a lot more efficient. But I wanted to mention it because it's related to the topic.

> You can make it work over data with positions by adding a position number

Yes, that's what I did in my experiments. I personally found it really simple to implement; I'm not sure how easy it is to implement a "real" error correction code.


Fountain code is decoded by essentially the same mechanism, it's called peeling. It's quite fun to implement. Though a plain fountain code has similar bandwidth inefficiencies... also the implementations that exist only correct erasures (though you can turn errors into erasures using a checksum/mac).

You can see the fountain code as a very sparse linear system that is usually solvable through a fairly simple greedy algorithm. You an augment the sparse linear system with a dense one like a RS code, which is solved by gauss-seidel which is also fun if less braindead simple.

This github user has a whole host of different competently implemented very high performance erasure codes:

https://github.com/catid/wirehair https://github.com/catid/leopard https://github.com/catid/fecal https://github.com/catid/siamese

As far as direct error codes. There is a competent BCH code implementation in the Linux kernel, but it only works over small fields so it doesn't support many independent blocks/packets.

Then there is https://github.com/bitcoin-core/minisketch which I am a coauthor of, which is a set corrector like IBLT which achieves perfect communications efficiency (at a trade off of cubic decode performance instead of linear but with good constant factors). I mention it mostly because it contains fast implementations of the relevant and tricky number theory algorithms needed for a fast RS error correcting code over big fields (any size up to 2^64 at least)... but I'm not actually aware of a performant large field RS error correcting implementation. (IIRC there is a RS error corrector in gnuradio but again small field).

(Mostly the small field stuff doesn't work for bigger fields because they use a chien search to find the roots of a polynomial, which is linear in the field size. Minisketch finds roots by factoring in log(|field|) time. Root finding is needed in algebraic error codes because it's how you find where the errors are. IIRC the linux kernel BCH implementation uses factoring like we do but is constructed to use a log table to turn multiplications into xors.)


That's an interesting reference. I came up with what may be a simpler case of that -- a structure that estimates the last time a key was seen. It's an upper bound, as collisions make the time more recent, but never the other way. Instead of a semi-order it's a simple ordering, but it may be similar to the compact approximator.

Rate limiters apparently use count-min over a fixed time interval, which is bursty, but I took the same hash structure and came up with "timestamp-min" that allows even pacing (also without "fill up"). The one-sided error is also useful for checking if a cache entry is too old.

It doesn't fix everything (eg DDOS), but it can prevent any single client from over-requesting, or stealing another client's requests (collisions can be made as hard as necessary).

https://github.com/KWillets/RecencySketch


While there are real use cases for the above, if you are looking at bloom filters make sure you check the above works for your need.

First-order with least fixed point FO(LFP) == P As P=co-P but we think NP!=co-NP, bloom filters often have value for access to a small and fast path to co-NP

In that case the collisions don't matter because often excluding membership is more valuable than proving membership and you have to choose one.

That is why outlook used it to reduce address completion, even if a hit requires the expensive call to the server, the set of email addresses not in your address book is far larger.

But it is all horses for courses.

If your need does fit in P you have a lot of options, while the options for co-NP is far more rarified.


Can you provide a link to a paper or reference?



Ooh, this seems like it could be useful for collision avoidance (where you need to know a lower bound to your time to impact.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: