Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wonder if you could do something with Bloom filters similar to:

  filters = {}
  candidates = []
  for i in range(len(pi)):
      block = pi[i:i+10]
      hash_prefix = block[:4]
      try:
          bloom = filters[hash_prefix]
      except KeyError:
          filters[hash_prefix] = BloomFilter().add(block)
          candidates.append((i, block))
          continue
      if block in bloom:
          candidates.append((i, block))
      else:
          bloom.add(block)
Basically, make one pass through to build a list of statistically likely candidates to evaluate later. Then use full string matching on those candidates to find the real matching sequences.

That might be helpful here because Bloom filters can say "we might have seen this string before" or "we definitely have not seen it before". That may greatly reduce the amount of storage you need.



Exactly this. Use a Bloom filter to find candidate solutions, write those out and review them later. Since you can write the entire candidate and its location, the review consists of a simple sort.

Digging in to estimate the speed, if your Bloom filter has a 0.1% chance of a collision then a trillion digits of pi will result in a billion candidates. This requires ~15 bits per entry or about 2TB of memory for the full Bloom filter. Using multiple passes on subsets of the data allows you to trade speed for space.

Note also that a 128 bit integer has room for 38 digits. This means that you can do all of the shifting using modulo, multiplication and addition operations. I would be very surprised if the cost of the disk I/O to read the raw digits is as fast as the sieving of candidates. As such, the speed of a single pass should be about the cost of reading about 500GB of data which means that multi-threading will be of limited use. This should be less than an hour on most machines. That means that one of my ancient Intel NUCs with 32GB should be able to scan the entire range in about a day no matter what size sequence we are looking for.


This is very nice, but I can see a way to optimize it even further.

Consider pi[i:i+10] and pi[i+1:i+11] they are highly similar. Intuitively it seems that storing them both seems wasteful. But how can we avoid storing them both? This is where we have to get clever.

We are looking for a length 18 match. If position a and position b are a length 18 match, then it holds that (a,b) is a length 10 match. (a+1,b+1) is also a length 10 match, and so is (a+2,b+2). Wouldn't it be nice if we could store only one of a,a+1,a+2 etc?

The naive attempt is to do the loop with a stride of say 3, but this will not work as it may happen that e.g. a=1 mod 3 and b=2 mod 3.

However there is another approach that does work. We need to use a position-independent way of deciding between a,a+1 etc. We can do it thusly. Loop over all the numbers. When we encounter position a we score the length 10 sequences starting at a, a+1.. a+8. Any scoring function will do as long as it doesn't have ties. For simplicity we can score=number. The highest scoring number is stored. As an implementation detail we can use a "sliding window maximum" algorithm to avoid recomputing and recomparing the scores.

Note that in this scheme we may actually end up storing more than one of a, ..a+8. But it will cut down the storage quite a bit at least.


That would be O(n^2) though, where n is the max length of the strings to compare, right?


No, it's possible to do much better https://www.geeksforgeeks.org/sliding-window-maximum-maximum...

Edit: It should also be mentioned that due to the conjectured normality of pi, comparing two sequences will on average terminate very quickly, just looking at one or two digits will be enough most of the time, so a comparison is expected O(1).


Indeed, the comparison length is 90% one hit, 99% two hits, etc.


My current method is O( (n/h)^2 ) where h is the hash size, which is really O(n^2) for n>>h. I'll have to think about this to see if it would be better.


I need to noodle on this. Hmm.


Oh that is interesting!.. and indeed I did consider doing a 'hash compression' where I would generate offsets to numbers that might be similar, then do direct comparison on those ones that are similar.

I'll take a crack at doing this - It could certainly be an improvement and would just be fun to try out.


Please let us know!


I read the original post and thought: "someone knows a more efficient way to do this."

Thanks for being that someone.


Full disclosure: that’s a wild guess. It may be helpful, or it might turn out to be a terrible idea for reasons I hadn’t considered. I’m definitely not claiming it is a better way.

That said, I’m curious now. I also refuse to allow myself to dig further into it. I’ve got enough spinning plates at the moment. :-)


Yep! I suspect there are lots of different ways to optimize this search, especially with some heuristic approaches. One of the great things about HN - There is always someone with a better way!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: