Hmm. Without any thought given to computability:
Rearrange the words of every document into alphabetical order, then sort the whole list of documents into alphabetical order. Now, start reading the documents, and for each one make it a tree with each word a leaf, BUT if there is already a tree with a fragment of the same phrase, then that tree is linked to as a node of the new tree.
So for the document 'christmas day' the tree
______d1______.
__/_________\__.
christmas___day
is formed. But for the document 'chistmas day special', the tree
_____d2______.
___/____\____.
d1_______special
is formed. Note: word order doesn't matter. For any given document, how will it know if there's already a tree that uses some of the same words? When a tree is created from a document, give it a hash for the full set of words, plus hashes for sub-strings. These can be stored in a big lookup table, so new documents know where to look for relevant trees (or -- use whatever solution is well established for this sort of problem). Might also need to algorithmically redefine older trees as we go along, ie. divide a tree into smaller phrases
For each leaf let it keep track of what trees its part of (including grandparents and above). Then to find the top-k, for each leaf object (if it's in the vocabulary), print out all the other leaves of all the trees its part of. Like
if leaf.__str__ == 'day':
___ for tree in leaf.trees
______ print tree.leaves
Then do a simple frequency count to find the top-k associated words.
So for the document 'christmas day' the tree
______d1______.
__/_________\__.
christmas___day
is formed. But for the document 'chistmas day special', the tree
_____d2______.
___/____\____.
d1_______special
is formed. Note: word order doesn't matter. For any given document, how will it know if there's already a tree that uses some of the same words? When a tree is created from a document, give it a hash for the full set of words, plus hashes for sub-strings. These can be stored in a big lookup table, so new documents know where to look for relevant trees (or -- use whatever solution is well established for this sort of problem). Might also need to algorithmically redefine older trees as we go along, ie. divide a tree into smaller phrases
For each leaf let it keep track of what trees its part of (including grandparents and above). Then to find the top-k, for each leaf object (if it's in the vocabulary), print out all the other leaves of all the trees its part of. Like
if leaf.__str__ == 'day':
___ for tree in leaf.trees
______ print tree.leaves
Then do a simple frequency count to find the top-k associated words.
(sorry for the n00b answer... IANACS)