Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sure,

I implemented the paper "How to squeeze a lexicon" http://www.n3labs.com/pdf/lexicon-squeeze.pdf with a few tweaks a while back. It uses a simple fixed format for all nodes, which works very well for datasets which contains many common prefixes and suffixes. For example, a polish dictionary containing 1.3 million strings is compressed down to 726KB. One of the thing I'm using it for is in a database for geopip stuff. It uses less than 1 byte per IP and can answer queries such as "find all values for the given IP-range" in O(k).

Dawid Weiss has a talk about how fine state machines are used in lucene at http://vimeo.com/26517310

The two approaches I've seen used to store key-value pairs in lexicons are:

1. Embed the value using a magic separator. E.g. (key0, value0), (key1, value1) becomes "key0.value1" and "key1.value1". The lexicon supports fast prefix lookups, which is used when looking up a key. As an added bonus, it supports multiple values per key out of the box, and you can easily find all key/values for a given prefix.

2. Add support for perfect minimal hashing to the lexicon. It will give you a unique ID for all keys, which can be used to do a direct lookup in another array containing the values. How to do this is not covered by the paper, but is an interesting exercise for the reader :-)

Now, back to the article. As cperciva mentioned, the dataset in the article isn't completely static and the usefulness for finite state machines drops fast if they need to be recompiled often. However, the key-value pairs are immutable, so it would be straightforward to implement a level-based scheme by using a simple structure for new items and move them into compressed lexicons periodically.



Isn't this basically a trie?


Thanks for your answer, I do appreciate that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: