The 'fill in the N blanks' results at the end are fascinating! N<64 are all pretty normal, but then for N=64 and N=512, it starts going on about the old 1930s cookbook it has and its grad school experiences! Wild. I think I would not be able to distinguish this from a selection of real Amazon reviews or similar informal text.
Question: How did they obtain the Colossal Clean Crawled Corpus (C4) they mention in the article?
Options:
1. "Mechanical Turk" style, a massive undertaking to manually clean up Common Crawl, perhaps using underpaid labor in third world countries (such as samasource.com does)
2. By means of somehow getting the internet to do it for them with something like reCAPTCHA
3. With the help of machine learning / traditional text processing
Ah wow, thanks! Not sure how I missed that. For other interested parties, here's the key section:
> Unfortunately, the majority of [the text in Common Crawl] is not natural language. Instead, it largely comprises gibberish or boiler-plate text like menus, error messages, or duplicate text. Furthermore, a good deal of the scraped text contains content that is unlikely to be helpful for any of the tasks we consider (offensive language, placeholder text, source code, etc.). To address these issues, we used the following heuristics for cleaning up Common Crawl’s web extracted text:
•We only retained lines that ended in a terminal punctuation mark (i.e. a period, exclamation mark, question mark, or end quotation mark).
•Many of the scraped pages contained warnings stating that Javascript should be enabled so we removed any line with the word Javascript.
•Some pages had placeholder “lorem ipsum” text; we removed any page where the phrase “lorem ipsum” appeared.
•Some pages inadvertently contained code. Since the curly bracket “{” appears in many programming languages (such as Javascript, widely used on the web) but not in natural text,we removed any pages that contained a curly bracket.
•To deduplicate the dataset, we discarded all but one of any three-sentence span occurring more than once in the dataset.
Additionally, since most of our downstream tasks are focused on English-language text, we used langdetect [https://pypi.org/project/langdetect/] to filter out any pages that were not classified as English with a probability of at least 0.99.
Looking at that list, I wonder what the unintended consequences of a decision like this is. If you want to create something related to sentiment analysis, that swear words you discarded is a useful signal, not noise right? If you wanted to use the dataset somehow for your tour guide business in Austria, how does it handle the the village called Fucking? Does T5 understand the British colloquialism for cigarettes? Can ornithologists talk to it about penguins and eagles, but not about yellow-bellied tits and blue-footed boobies?
That made me think of something along the lines of "backdooring a dataset" by introducing some hard to find but easy to trigger failure modes or fingerprinting for any application built on top of it.
Agreed! The interesting thing is that basic unsupervised pre-training seems to produce a model which functions not only as a knowledge base but also an NLU system which can effectively query the knowledge base using natural text questions. This is exactly what our follow-up paper is on.
Yes, unfortunately we have to rely on the very brittle "exact match" method of evaluating whether an answer is correct. FWIW and perhaps surprisingly, this is the primary way question-answering systems are evaluated in common benchmarks. I totally agree that fine-tuning T5 for answer grading would be super interesting!
I think it makes some sense to evaluate models like this, as you want to be conservative with the answers you accept (though my second example shows that it isn't always conservative), and models don't have feelings to hurt if they are docked points for not being precise enough. Humans, of course, are more sensitive.
I'm sorry for being blunt, but is it possible that the `very brittle "exact match" method of evaluating whether an answer is correct` means value equality? Is `==` the secret sauce?
Loss of consciousness was due to cerebral hypoxia due to cardiac arrest resulting from myocardial hypoxia. Factors of temperature, pressure and environmental concentrations of carbon monoxide, carbon dioxide, oxygen and pulmonary irritants were changing extremely rapidly. It is impossible to integrate these variables on the basis of available information with the dynamic physiological and metabolic conditions they produced, in order to arrive at a precise statement of time when consciousness was lost and when death supervened. The combined effect of these environmental factors dramatically increased the lethal effect of any factor by itself. It is estimated that consciousness was lost between 15 and 30 seconds after the first suit failed. Chances of resuscitation decreased rapidly thereafter and were irrevocably lost within 4 minutes.”
Yeah, it asked who was the director of the CIA from 1976-1981. I answered "George Bush" (lazy, typing on phone), it answered the more exact "George H. W. Bush". It marked me correct, but it was marked wrong. It also asked me what country the island of "Honsh" was in (should be Honshu). I answered Japan and was correct, though. It guessed Iceland.
The blogpost has a summary of our paper from October (a bit late, sorry!) but also has some (fun?) new results on closed-book question answering and fill-in-the-blank text generation.
No mention of MASS by Microsoft? It was afaik one of the first pretraining schemes for a full transformer outside of XLM.
Imho a bit unfortunate, as is calling the decoder or the encoder of a transformer "a transformer", as it has happened with GPT and BERT, which now forces people to use "full transformer" or using phrases like the title of the blog post.
We include MASS in our empirical survey (see e.g. section 3.3.2 of our paper, https://arxiv.org/pdf/1910.10683.pdf). FWIW, people were pre-training Transformers before MASS, e.g. "Improving Language Understanding by Generative Pre-Training" by Radford et al. from 2018. Even further back, "Semi-Supervised Sequence Learning" by Dai et al. describe pre-training an RNN encoder-decoder model for subsequent transfer.
But Radford is just pretraining the decoder and qualitatively different from a seq2seq approach such as MASS. If we just look at the original paper from Vaswani, than "pretraining a transformer" imho should always only have meant pretraing the encoder and decoder. Obviously that ship has sailed.
(I've done work in QA and have played at building Jeopardy style QA models)
Watson (Jeopardy Watson, not the IBM branding exercise Watson is now) has much weaker text understanding models, but has much much better optimisations for the incremental style of data release that you see in Jeopardy (ie, you get more and more data the longer you listen). IBM did a lot of work optimising when to answer as well as trying to get the correct answer.
The closest analogy that is regularly studied in modern QA research is "Quizbowl"-style datasets, but these tend to be much smaller than the SQUAD datasets that most modern neural network QA systems are built against.
The example figure has a weird entry for the summary task. The input is:
"summarize: state authorities dispatched emergency crews tuesday to survey the damage after an onslaught of severe weather in mississippi..."
And the output is:
"six people hospitalized after a storm in attala county."
That's quite a bad summary, no mention of "six" people in the original text, no mention of hospitalization. And "attala" county is too specific, a precision not present in the original text.
If that's the result of their model, that's not good. If it's coming from the training set, it's an even bigger problem. I guess it's the result of the model, because some issues can be explained by correlations ("emergency" correlates with "hospital", "mississipi" correlates with "attala").
I'm wondering why they chose this example for the flagship figure of their paper.
I missed that ellipsis cue, thanks for pointing it out. The complete excerpt is not in the complete paper though, but it's probably in their released data.
Interesting thought! I think you'd have to provide more than the image XML though. The locality of XML elements in text form doesn't necessarily correspond with their locality in rendered image form, that would be tricky as there wouldn't be a lot of "context" to go off of.
would pre-training on non-English text (using a joint dictionary) improve translation performance?
I'm not sure if the information on http://nlpprogress.com/english/machine_translation.html
is accurate, but it appears that the top translation results rely on backtranslation, boosting and other data augmentation techniques with a vanilla transformer model. It would be interesting to see the Bleu scores for T5 that's more optimized for translation specifically.
It'd be really interesting to see if T5 could be made to perform a reading comprehension or summarization task where the source article is in a different language from the question and answer. Seems like a potentially interesting application for a model as flexible as this one.
> the full 11 billion parameter model achieves the exact text of the answer 50%[ to 30% of the time]
If you need to tweak 11-billion parameters to get a particular result, I don’t see how you can call whatever is being called a model, more like a component of a model.
Will it be able to learn the universal Turing machine? We could "train" a compiler & runtime into existence. Probably too much of a correctness constraint (since the output has to be the exact result of computing the encoded program with the encoded machine).