I wonder if you could do something with Bloom filters similar to: filters = {} c...

ted_dunning · on Aug 28, 2024

Exactly this. Use a Bloom filter to find candidate solutions, write those out and review them later. Since you can write the entire candidate and its location, the review consists of a simple sort.

Digging in to estimate the speed, if your Bloom filter has a 0.1% chance of a collision then a trillion digits of pi will result in a billion candidates. This requires ~15 bits per entry or about 2TB of memory for the full Bloom filter. Using multiple passes on subsets of the data allows you to trade speed for space.

Note also that a 128 bit integer has room for 38 digits. This means that you can do all of the shifting using modulo, multiplication and addition operations. I would be very surprised if the cost of the disk I/O to read the raw digits is as fast as the sieving of candidates. As such, the speed of a single pass should be about the cost of reading about 500GB of data which means that multi-threading will be of limited use. This should be less than an hour on most machines. That means that one of my ancient Intel NUCs with 32GB should be able to scan the entire range in about a day no matter what size sequence we are looking for.

im3w1l · on Aug 28, 2024

This is very nice, but I can see a way to optimize it even further.

Consider pi[i:i+10] and pi[i+1:i+11] they are highly similar. Intuitively it seems that storing them both seems wasteful. But how can we avoid storing them both? This is where we have to get clever.

We are looking for a length 18 match. If position a and position b are a length 18 match, then it holds that (a,b) is a length 10 match. (a+1,b+1) is also a length 10 match, and so is (a+2,b+2). Wouldn't it be nice if we could store only one of a,a+1,a+2 etc?

The naive attempt is to do the loop with a stride of say 3, but this will not work as it may happen that e.g. a=1 mod 3 and b=2 mod 3.

However there is another approach that does work. We need to use a position-independent way of deciding between a,a+1 etc. We can do it thusly. Loop over all the numbers. When we encounter position a we score the length 10 sequences starting at a, a+1.. a+8. Any scoring function will do as long as it doesn't have ties. For simplicity we can score=number. The highest scoring number is stored. As an implementation detail we can use a "sliding window maximum" algorithm to avoid recomputing and recomparing the scores.

Note that in this scheme we may actually end up storing more than one of a, ..a+8. But it will cut down the storage quite a bit at least.

kstrauser · on Aug 28, 2024

That would be O(n^2) though, where n is the max length of the strings to compare, right?

im3w1l · on Aug 28, 2024

No, it's possible to do much better https://www.geeksforgeeks.org/sliding-window-maximum-maximum...

Edit: It should also be mentioned that due to the conjectured normality of pi, comparing two sequences will on average terminate very quickly, just looking at one or two digits will be enough most of the time, so a comparison is expected O(1).

sponaugle · on Aug 28, 2024

Indeed, the comparison length is 90% one hit, 99% two hits, etc.

sponaugle · on Aug 28, 2024

My current method is O( (n/h)^2 ) where h is the hash size, which is really O(n^2) for n>>h. I'll have to think about this to see if it would be better.

sponaugle · on Aug 28, 2024

I need to noodle on this. Hmm.

sponaugle · on Aug 28, 2024

Oh that is interesting!.. and indeed I did consider doing a 'hash compression' where I would generate offsets to numbers that might be similar, then do direct comparison on those ones that are similar.

I'll take a crack at doing this - It could certainly be an improvement and would just be fun to try out.

kstrauser · on Aug 28, 2024

Please let us know!

iambateman · on Aug 28, 2024

I read the original post and thought: "someone knows a more efficient way to do this."

Thanks for being that someone.

kstrauser · on Aug 28, 2024

Full disclosure: that’s a wild guess. It may be helpful, or it might turn out to be a terrible idea for reasons I hadn’t considered. I’m definitely not claiming it is a better way.

That said, I’m curious now. I also refuse to allow myself to dig further into it. I’ve got enough spinning plates at the moment. :-)

sponaugle · on Aug 28, 2024

Yep! I suspect there are lots of different ways to optimize this search, especially with some heuristic approaches. One of the great things about HN - There is always someone with a better way!