bibobap's comments

bibobap · on Feb 25, 2020

In section 2.2 of the paper they describe the process they use: applying a series of heuristic rules to the text. (Also the dataset is 750Gb... )

akie · on Feb 25, 2020

Ah wow, thanks! Not sure how I missed that. For other interested parties, here's the key section:

> Unfortunately, the majority of [the text in Common Crawl] is not natural language. Instead, it largely comprises gibberish or boiler-plate text like menus, error messages, or duplicate text. Furthermore, a good deal of the scraped text contains content that is unlikely to be helpful for any of the tasks we consider (offensive language, placeholder text, source code, etc.). To address these issues, we used the following heuristics for cleaning up Common Crawl’s web extracted text:

•We only retained lines that ended in a terminal punctuation mark (i.e. a period, exclamation mark, question mark, or end quotation mark).

•We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”. [https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...]

•Many of the scraped pages contained warnings stating that Javascript should be enabled so we removed any line with the word Javascript.

•Some pages had placeholder “lorem ipsum” text; we removed any page where the phrase “lorem ipsum” appeared.

•Some pages inadvertently contained code. Since the curly bracket “{” appears in many programming languages (such as Javascript, widely used on the web) but not in natural text,we removed any pages that contained a curly bracket.

•To deduplicate the dataset, we discarded all but one of any three-sentence span occurring more than once in the dataset.

Additionally, since most of our downstream tasks are focused on English-language text, we used langdetect [https://pypi.org/project/langdetect/] to filter out any pages that were not classified as English with a probability of at least 0.99.

MattConfluence · on Feb 25, 2020

> We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”. [https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...]

Looking at that list, I wonder what the unintended consequences of a decision like this is. If you want to create something related to sentiment analysis, that swear words you discarded is a useful signal, not noise right? If you wanted to use the dataset somehow for your tour guide business in Austria, how does it handle the the village called Fucking? Does T5 understand the British colloquialism for cigarettes? Can ornithologists talk to it about penguins and eagles, but not about yellow-bellied tits and blue-footed boobies?

vimax · on Feb 25, 2020

That made me think of something along the lines of "backdooring a dataset" by introducing some hard to find but easy to trigger failure modes or fingerprinting for any application built on top of it.

lowdose · on Feb 25, 2020

Sounds like an awesome idea, put some easter eggs in the common crawl to compromise the future of NLP.

perl4ever · on Feb 25, 2020

...not to mention Rhenquist, Brownmiller, Potter Stewart, etc.

bibobap · on Feb 15, 2020

> A person in Nigeria is not being helped by the person in US disclosing their salary information.

It might inspire them to move, or work remotely, etc.

bibobap · on Feb 12, 2020

Why use time machine over rsync?

bibobap · on Jan 22, 2020

Feel free to post those lines

quickthrower2 · on Jan 26, 2020

I reckon you'd need more than 9 "\n" characters to get it done.

But in seriousness the 10 lines would be just use local storage or wotnot to store a tag, then call tracker.com?tag=... on each page load. "Rest is done on the server (TM)"