Is this true though? I haven't done the experiment, but I can envision the LLM critiquing its own output (if it was created in a different session) and iteratively correcting it and always finding flaws in it. Are LLMs even primed to say "this is perfect and it needs no further improvements"?
What I have seen is ChatGPT and Claude battling it out, always correcting and finding fault with each other's output (trying to solve the same problem). It's hilarious.
If you ask an AI to grade an essay, it will grade the essay highest that it wrote itself.