Hmm. Without any thought given to computability: Rearrange the words of every do...

Hmm. Without any thought given to computability: Rearrange the words of every document into alphabetical order, then sort the whole list of documents into alphabetical order. Now, start reading the documents, and for each one make it a tree with each word a leaf, BUT if there is already a tree with a fragment of the same phrase, then that tree is linked to as a node of the new tree.

So for the document 'christmas day' the tree

______d1______.

__/_________\__.

christmas___day

is formed. But for the document 'chistmas day special', the tree

_____d2______.

___/____\____.

d1_______special

is formed. Note: word order doesn't matter. For any given document, how will it know if there's already a tree that uses some of the same words? When a tree is created from a document, give it a hash for the full set of words, plus hashes for sub-strings. These can be stored in a big lookup table, so new documents know where to look for relevant trees (or -- use whatever solution is well established for this sort of problem). Might also need to algorithmically redefine older trees as we go along, ie. divide a tree into smaller phrases

For each leaf let it keep track of what trees its part of (including grandparents and above). Then to find the top-k, for each leaf object (if it's in the vocabulary), print out all the other leaves of all the trees its part of. Like

if leaf.__str__ == 'day':

___ for tree in leaf.trees

______ print tree.leaves

Then do a simple frequency count to find the top-k associated words.

(sorry for the n00b answer... IANACS)