> If the summaries of the test documents are good, future summaries will probabl...

> If the summaries of the test documents are good, future summaries will probably be OK too

But that is exactly what is problematic with hallucinations. It's a rare / exceptional behaviour that triggers extreme departure from reality. So you can't estimate the extremes by extrapolating from common / moderate observations. You would have to test a lot of previous documents to be confident, and even then there would be a residual risk.