Or we could just ask the same question on 3 different LLMs, ideally a large LLM, a RAG LLM and a small one, then use LLM again to rewrite the final answer. When models contradict each other there is likely hallucination going on, but correct answers tend to converge.
Why use an LLM to check the work of a different LLM?
You could use the same technique that this paper describes to compare the answers each LLM gave. LLMs don’t have to be in opposition to traditional NLP techniques