Ah wow, thanks! Not sure how I missed that. For other interested parties, here's the key section:
> Unfortunately, the majority of [the text in Common Crawl] is not natural language. Instead, it largely comprises gibberish or boiler-plate text like menus, error messages, or duplicate text. Furthermore, a good deal of the scraped text contains content that is unlikely to be helpful for any of the tasks we consider (offensive language, placeholder text, source code, etc.). To address these issues, we used the following heuristics for cleaning up Common Crawl’s web extracted text:
•We only retained lines that ended in a terminal punctuation mark (i.e. a period, exclamation mark, question mark, or end quotation mark).
•Many of the scraped pages contained warnings stating that Javascript should be enabled so we removed any line with the word Javascript.
•Some pages had placeholder “lorem ipsum” text; we removed any page where the phrase “lorem ipsum” appeared.
•Some pages inadvertently contained code. Since the curly bracket “{” appears in many programming languages (such as Javascript, widely used on the web) but not in natural text,we removed any pages that contained a curly bracket.
•To deduplicate the dataset, we discarded all but one of any three-sentence span occurring more than once in the dataset.
Additionally, since most of our downstream tasks are focused on English-language text, we used langdetect [https://pypi.org/project/langdetect/] to filter out any pages that were not classified as English with a probability of at least 0.99.
Looking at that list, I wonder what the unintended consequences of a decision like this is. If you want to create something related to sentiment analysis, that swear words you discarded is a useful signal, not noise right? If you wanted to use the dataset somehow for your tour guide business in Austria, how does it handle the the village called Fucking? Does T5 understand the British colloquialism for cigarettes? Can ornithologists talk to it about penguins and eagles, but not about yellow-bellied tits and blue-footed boobies?
That made me think of something along the lines of "backdooring a dataset" by introducing some hard to find but easy to trigger failure modes or fingerprinting for any application built on top of it.
I reckon you'd need more than 9 "\n" characters to get it done.
But in seriousness the 10 lines would be just use local storage or wotnot to store a tag, then call tracker.com?tag=... on each page load.
"Rest is done on the server (TM)"