This is a similar problem to what was observed in Diffusion models going "MAD" when trained on synthetic data. https://arxiv.org/abs/2307.01850 . Therefore, going forward AI companies will find it increasingly difficult to get their data by scraping the. web, because web will be full of synthetically generated data.