This is poorly worded. Detecting "hallucinations" as the term is commonly used, as in a model making up answers not actually in its source text, or answers that are generally untrue, is fundamentally impossible. Verifying the truth of a statement requires empirical investigation. It isn't a feature of language itself. This is just the basic analytic/synthetic distinction identified by Kant centuries ago. It's why we have science in the first place and don't generate new knowledge by reading and learning to make convincing sounding arguments.
Your far more scaled-down claim, however, that you can detect answers that don't address a prompt at all, or make claims when summarizing known other text that isn't actually in the original text, is definitely doable, but raises a maybe naive or stupid question. If you can do this, why not sell an LLM that simply doesn't do these stupid things in the first place? Or why do the people currently selling LLMs not just automatically detect obvious errors and not make them? Doesn't your business as constituted depend upon LLM vendors never figuring out how to do this themselves?
And whats the false positive rate? Its good and dandy that you find most answers that are hallucinations but do you flag a significant % of answers that are not really hallucinations too? For instance, if a summarization doesnt use any sentences or even words from the original text, that doesnt necessarily mean its a hallucination. It could simply be a full paraphrased summary
Could you not detect likely hallucinations by running the same prompt multiple times between different models and looking at the vector divergence between the outputs? Kind of like an agreement between say GPT, Llama, other models which all agree - yes, this is likely a hallucination.
It's not 100% but enough to basically say to the human: "hey, look at this".
You can do it and it's a good way of doing that - from our experiments that can catch most errors. You don't even need to use different models - even using the same model (I don't mean asking "are you sure?" - just re-running the same workflow) will give you nice results. The only problem is that it's super expensive to run it on all your traces so I wouldn't recommend that as a monitoring tool.
Your far more scaled-down claim, however, that you can detect answers that don't address a prompt at all, or make claims when summarizing known other text that isn't actually in the original text, is definitely doable, but raises a maybe naive or stupid question. If you can do this, why not sell an LLM that simply doesn't do these stupid things in the first place? Or why do the people currently selling LLMs not just automatically detect obvious errors and not make them? Doesn't your business as constituted depend upon LLM vendors never figuring out how to do this themselves?