Whether or not Hallucination “happens often” depends heavily on the task domain and how you define correctness. In a simple conversational question about general knowledge, an LLM might be right more often than not — but in complex domains like cloud config, compliance, law, or system design, even a single confidently wrong answer can be catastrophic.
The real risk isn’t frequency averaged across all use cases — it’s impact when it does occur. That’s why confidence alone isn’t a good proxy: models inherently generate fluent text whether they know the right answer or not.
A better way to think about it is: Does this output satisfy the contract you intended for your use case? If not, it’s unfit for production regardless of overall accuracy rates.
The real risk isn’t frequency averaged across all use cases — it’s impact when it does occur. That’s why confidence alone isn’t a good proxy: models inherently generate fluent text whether they know the right answer or not.
A better way to think about it is: Does this output satisfy the contract you intended for your use case? If not, it’s unfit for production regardless of overall accuracy rates.