Don’t forget machine-translated texts, where until ~2017 the translation was likely done by something much dumber / semantically lossy than an LLM, and after 2017 was basically done by an early form of LLM (the Transformers architecture originating in Google Translate.)
Many historical English-language news reports published on the English-language websites of foreign news media from non-English-speaking countries, from 1998 (Babelfish era) to ~a few months ago, may be unreliable training data for this reason.
Many historical English-language news reports published on the English-language websites of foreign news media from non-English-speaking countries, from 1998 (Babelfish era) to ~a few months ago, may be unreliable training data for this reason.