I recently discovered "compact approximators", which can be seen as a generalisa...

thomasmg · 2025-05-02T17:43:50 1746207830

This is a bit similar to count-min sketch, which can be used to calculate frequencies in a stream.

Also interesting are "Invertible Bloom Lookup Tables" (IBLT), https://arxiv.org/pdf/1101.2245 , which allow to add and remove data, and allow to retrieve the remaining data if "little" data remains. That means, it can be used for error correction: all all the _correct_ data (let's say 1 GB of correct data) into a 1 MB IBLT. Let's say a downloader finds that the checksum doesn't match: he can download that 1 MB IBLT, and remove the data from his (invalid) download. What remains is the delta: the error correction data. I know, there are other ways you could do error correction, but this is very fast, and quite interesting technically.

nullc · 2025-05-03T09:21:02 1746264062

IBLT operates over sets though, your downloading example is not over a set (the data has a position). You can make it work over data with positions by adding a position number to every encoding unit, but it's pure overhead. Other codes are more efficient.

IBLT is also particularly bandwidth inefficient when the amount of correction data is not very large. Ordinary codes can recover a missing block (known error location) with just the amount of data missing or with twice the amount if the location(s) are unknown.

There are codes with quasi-linear performance, though for your example application the number of errors are few so it shouldn't really matter if the code is cubic in the number of errors to decode (which is about the worst any should be).

thomasmg · 2025-05-03T14:34:46 1746282886

Yes, I know it is not designed to be a error-correction code, and other codes (turbo code, fountain code), are a lot more efficient. But I wanted to mention it because it's related to the topic.

> You can make it work over data with positions by adding a position number

Yes, that's what I did in my experiments. I personally found it really simple to implement; I'm not sure how easy it is to implement a "real" error correction code.

nullc · 2025-05-03T20:07:05 1746302825

Fountain code is decoded by essentially the same mechanism, it's called peeling. It's quite fun to implement. Though a plain fountain code has similar bandwidth inefficiencies... also the implementations that exist only correct erasures (though you can turn errors into erasures using a checksum/mac).

You can see the fountain code as a very sparse linear system that is usually solvable through a fairly simple greedy algorithm. You an augment the sparse linear system with a dense one like a RS code, which is solved by gauss-seidel which is also fun if less braindead simple.

This github user has a whole host of different competently implemented very high performance erasure codes:

https://github.com/catid/wirehair https://github.com/catid/leopard https://github.com/catid/fecal https://github.com/catid/siamese

As far as direct error codes. There is a competent BCH code implementation in the Linux kernel, but it only works over small fields so it doesn't support many independent blocks/packets.

Then there is https://github.com/bitcoin-core/minisketch which I am a coauthor of, which is a set corrector like IBLT which achieves perfect communications efficiency (at a trade off of cubic decode performance instead of linear but with good constant factors). I mention it mostly because it contains fast implementations of the relevant and tricky number theory algorithms needed for a fast RS error correcting code over big fields (any size up to 2^64 at least)... but I'm not actually aware of a performant large field RS error correcting implementation. (IIRC there is a RS error corrector in gnuradio but again small field).

(Mostly the small field stuff doesn't work for bigger fields because they use a chien search to find the roots of a polynomial, which is linear in the field size. Minisketch finds roots by factoring in log(|field|) time. Root finding is needed in algebraic error codes because it's how you find where the errors are. IIRC the linux kernel BCH implementation uses factoring like we do but is constructed to use a log table to turn multiplications into xors.)

kwillets · 2025-05-03T15:24:47 1746285887

That's an interesting reference. I came up with what may be a simpler case of that -- a structure that estimates the last time a key was seen. It's an upper bound, as collisions make the time more recent, but never the other way. Instead of a semi-order it's a simple ordering, but it may be similar to the compact approximator.

Rate limiters apparently use count-min over a fixed time interval, which is bursty, but I took the same hash structure and came up with "timestamp-min" that allows even pacing (also without "fill up"). The one-sided error is also useful for checking if a cache entry is too old.

It doesn't fix everything (eg DDOS), but it can prevent any single client from over-requesting, or stealing another client's requests (collisions can be made as hard as necessary).

https://github.com/KWillets/RecencySketch

nyrikki · 2025-05-02T16:24:09 1746203049

While there are real use cases for the above, if you are looking at bloom filters make sure you check the above works for your need.

First-order with least fixed point FO(LFP) == P As P=co-P but we think NP!=co-NP, bloom filters often have value for access to a small and fast path to co-NP

In that case the collisions don't matter because often excluding membership is more valuable than proving membership and you have to choose one.

That is why outlook used it to reduce address completion, even if a hit requires the expensive call to the server, the set of email addresses not in your address book is far larger.

But it is all horses for courses.

If your need does fit in P you have a lot of options, while the options for co-NP is far more rarified.

abetusk · 2025-05-02T14:29:35 1746196175

Can you provide a link to a paper or reference?

thomasmg · 2025-05-02T17:34:22 1746207262

This is described here: https://en.wikipedia.org/wiki/Bloom_filter#Compact_approxima...

taneq · 2025-05-02T11:43:34 1746186214

Ooh, this seems like it could be useful for collision avoidance (where you need to know a lower bound to your time to impact.)