T5: The Text-to-Text Transfer Transformer

thatcherc · on Feb 24, 2020

The 'fill in the N blanks' results at the end are fascinating! N<64 are all pretty normal, but then for N=64 and N=512, it starts going on about the old 1930s cookbook it has and its grad school experiences! Wild. I think I would not be able to distinguish this from a selection of real Amazon reviews or similar informal text.

londons_explore · on Feb 24, 2020

And that is because a good chunk of amazon reviews are written with various machine-generation systems...

pragmatick · on Feb 25, 2020

It reads like a typical blog recipe introduction.

Ajedi32 · on Feb 25, 2020

Which makes sense, given the corpus they trained T5 on.

akie · on Feb 25, 2020

Question: How did they obtain the Colossal Clean Crawled Corpus (C4) they mention in the article?

Options:

1. "Mechanical Turk" style, a massive undertaking to manually clean up Common Crawl, perhaps using underpaid labor in third world countries (such as samasource.com does)

2. By means of somehow getting the internet to do it for them with something like reCAPTCHA

3. With the help of machine learning / traditional text processing

4. Some other way

Anyone has any ideas? I'm intrigued. The paper [https://arxiv.org/pdf/1910.10683.pdf] and the website [https://www.tensorflow.org/datasets/catalog/c4] mention almost nothing, except for an option to switch off the cleaning & deduplication, which hints at option number 3.

bibobap · on Feb 25, 2020

In section 2.2 of the paper they describe the process they use: applying a series of heuristic rules to the text. (Also the dataset is 750Gb... )

akie · on Feb 25, 2020

Ah wow, thanks! Not sure how I missed that. For other interested parties, here's the key section:

> Unfortunately, the majority of [the text in Common Crawl] is not natural language. Instead, it largely comprises gibberish or boiler-plate text like menus, error messages, or duplicate text. Furthermore, a good deal of the scraped text contains content that is unlikely to be helpful for any of the tasks we consider (offensive language, placeholder text, source code, etc.). To address these issues, we used the following heuristics for cleaning up Common Crawl’s web extracted text:

•We only retained lines that ended in a terminal punctuation mark (i.e. a period, exclamation mark, question mark, or end quotation mark).

•We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”. [https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...]

•Many of the scraped pages contained warnings stating that Javascript should be enabled so we removed any line with the word Javascript.

•Some pages had placeholder “lorem ipsum” text; we removed any page where the phrase “lorem ipsum” appeared.

•Some pages inadvertently contained code. Since the curly bracket “{” appears in many programming languages (such as Javascript, widely used on the web) but not in natural text,we removed any pages that contained a curly bracket.

•To deduplicate the dataset, we discarded all but one of any three-sentence span occurring more than once in the dataset.

Additionally, since most of our downstream tasks are focused on English-language text, we used langdetect [https://pypi.org/project/langdetect/] to filter out any pages that were not classified as English with a probability of at least 0.99.

MattConfluence · on Feb 25, 2020

> We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”. [https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...]

Looking at that list, I wonder what the unintended consequences of a decision like this is. If you want to create something related to sentiment analysis, that swear words you discarded is a useful signal, not noise right? If you wanted to use the dataset somehow for your tour guide business in Austria, how does it handle the the village called Fucking? Does T5 understand the British colloquialism for cigarettes? Can ornithologists talk to it about penguins and eagles, but not about yellow-bellied tits and blue-footed boobies?

vimax · on Feb 25, 2020

That made me think of something along the lines of "backdooring a dataset" by introducing some hard to find but easy to trigger failure modes or fingerprinting for any application built on top of it.

lowdose · on Feb 25, 2020

Sounds like an awesome idea, put some easter eggs in the common crawl to compromise the future of NLP.

perl4ever · on Feb 25, 2020

...not to mention Rhenquist, Brownmiller, Potter Stewart, etc.

adarob · on Feb 25, 2020

Code to reproduce can be found here: https://www.tensorflow.org/datasets/catalog/c4

We are also talking to Common Crawl to see if they will host a prepared copy since we do not have redistribution rights.

anigbrowl · on Feb 24, 2020

To put these results in perspective, the T5 team went head-to-head with the model in a pub trivia challenge and lost!

Trivia fact recall and NLP seem like two quite different tasks even though both are required to do well in a quiz.

craffel · on Feb 25, 2020

Agreed! The interesting thing is that basic unsupervised pre-training seems to produce a model which functions not only as a knowledge base but also an NLU system which can effectively query the knowledge base using natural text questions. This is exactly what our follow-up paper is on.

modeless · on Feb 24, 2020

The trivia game (https://t5-trivia.glitch.me/) needs a little work.

> Q: How did Gus Grissom, Ed White and Roger B. Chaffee die in 1967?

> You: "Apollo 1" WRONG

> T5: "They were killed when their Apollo 1 spacecraft exploded" WRONG

> Correct answer: burned to death

> Q: Which Alpine peak is known in Italy as Monte Cervino?

> You: "Monte Cervino" CORRECT

I wonder how many of the problems with this game could be fixed by applying T5 itself to the answer grading.

craffel · on Feb 24, 2020

Yes, unfortunately we have to rely on the very brittle "exact match" method of evaluating whether an answer is correct. FWIW and perhaps surprisingly, this is the primary way question-answering systems are evaluated in common benchmarks. I totally agree that fine-tuning T5 for answer grading would be super interesting!

modeless · on Feb 24, 2020

I think it makes some sense to evaluate models like this, as you want to be conservative with the answers you accept (though my second example shows that it isn't always conservative), and models don't have feelings to hurt if they are docked points for not being precise enough. Humans, of course, are more sensitive.

lsb · on Feb 25, 2020

Does that mean that answer grading would become like comparing summaries of a given text?

dmit · on Feb 24, 2020

I'm sorry for being blunt, but is it possible that the `very brittle "exact match" method of evaluating whether an answer is correct` means value equality? Is `==` the secret sauce?

craffel · on Feb 24, 2020

It's slightly more than that -- it also involves lowercasing and removing articles before testing for string equality.

svnpenn · on Feb 25, 2020

Why are you replying to every single comment?

schoen · on Feb 25, 2020

I think craffel (probably "Colin Raffel, Senior Research Scientist, Google Research") was directly involved in this research!

craffel · on Feb 25, 2020

Yes, that's me! Sorry if I'm being overeager, I like talking about my research!

schoen · on Feb 25, 2020

I think it's amazing how frequently people involved in various CS and IT things are directly participating in threads about their work here on HN.

Someone · on Feb 25, 2020

”Correct answer: burned to death”

More likely: suffocated (as with most fire deaths) https://history.nasa.gov/Apollo204/invest.html:

”d. MEDICAL ANALYSIS

Loss of consciousness was due to cerebral hypoxia due to cardiac arrest resulting from myocardial hypoxia. Factors of temperature, pressure and environmental concentrations of carbon monoxide, carbon dioxide, oxygen and pulmonary irritants were changing extremely rapidly. It is impossible to integrate these variables on the basis of available information with the dynamic physiological and metabolic conditions they produced, in order to arrive at a precise statement of time when consciousness was lost and when death supervened. The combined effect of these environmental factors dramatically increased the lethal effect of any factor by itself. It is estimated that consciousness was lost between 15 and 30 seconds after the first suit failed. Chances of resuscitation decreased rapidly thereafter and were irrevocably lost within 4 minutes.”

markdog12 · on Feb 25, 2020

Yeah, it asked who was the director of the CIA from 1976-1981. I answered "George Bush" (lazy, typing on phone), it answered the more exact "George H. W. Bush". It marked me correct, but it was marked wrong. It also asked me what country the island of "Honsh" was in (should be Honshu). I answered Japan and was correct, though. It guessed Iceland.

kazinator · on Feb 25, 2020

TIL: Leonardo Di Caprio, not Robert Redford, played Jay Gatsby in a 1974 adaptation of The Great Gatsby.

kps · on Feb 24, 2020

Reminds me of the early Unix ‘quiz’ game.

lanekelly · on Feb 24, 2020

Anyone know what's new in the blogpost? T5 has been out for a few months now.

craffel · on Feb 24, 2020

The blogpost has a summary of our paper from October (a bit late, sorry!) but also has some (fun?) new results on closed-book question answering and fill-in-the-blank text generation.

kitsune_ · on Feb 24, 2020

No mention of MASS by Microsoft? It was afaik one of the first pretraining schemes for a full transformer outside of XLM.

Imho a bit unfortunate, as is calling the decoder or the encoder of a transformer "a transformer", as it has happened with GPT and BERT, which now forces people to use "full transformer" or using phrases like the title of the blog post.

craffel · on Feb 24, 2020

We include MASS in our empirical survey (see e.g. section 3.3.2 of our paper, https://arxiv.org/pdf/1910.10683.pdf). FWIW, people were pre-training Transformers before MASS, e.g. "Improving Language Understanding by Generative Pre-Training" by Radford et al. from 2018. Even further back, "Semi-Supervised Sequence Learning" by Dai et al. describe pre-training an RNN encoder-decoder model for subsequent transfer.

kitsune_ · on Feb 24, 2020

But Radford is just pretraining the decoder and qualitatively different from a seq2seq approach such as MASS. If we just look at the original paper from Vaswani, than "pretraining a transformer" imho should always only have meant pretraing the encoder and decoder. Obviously that ship has sailed.

atomoton · on Feb 24, 2020

Wonder how this would compare with Watson at playing Jeopardy...

halflings · on Feb 24, 2020

I would assume most QA (question answering) models blow Watson out of the water. A lot has been done since then. See: https://aclweb.org/aclwiki/Question_Answering_(State_of_the_...

nl · on Feb 25, 2020

(I've done work in QA and have played at building Jeopardy style QA models)

Watson (Jeopardy Watson, not the IBM branding exercise Watson is now) has much weaker text understanding models, but has much much better optimisations for the incremental style of data release that you see in Jeopardy (ie, you get more and more data the longer you listen). IBM did a lot of work optimising when to answer as well as trying to get the correct answer.

The closest analogy that is regularly studied in modern QA research is "Quizbowl"-style datasets, but these tend to be much smaller than the SQUAD datasets that most modern neural network QA systems are built against.

benrbray · on Feb 24, 2020

Sure, but Jeopardy is all about AQ (Answer-Questioning) :)

3wolf · on Feb 25, 2020

Probably better than this: https://twitter.com/jeopardygoat

kuprel · on Feb 25, 2020

How would this perform on the reading comprehension part of the SAT?

vbarrielle · on Feb 25, 2020

The example figure has a weird entry for the summary task. The input is:

"summarize: state authorities dispatched emergency crews tuesday to survey the damage after an onslaught of severe weather in mississippi..."

And the output is:

"six people hospitalized after a storm in attala county."

That's quite a bad summary, no mention of "six" people in the original text, no mention of hospitalization. And "attala" county is too specific, a precision not present in the original text.

If that's the result of their model, that's not good. If it's coming from the training set, it's an even bigger problem. I guess it's the result of the model, because some issues can be explained by correlations ("emergency" correlates with "hospital", "mississipi" correlates with "attala").

I'm wondering why they chose this example for the flagship figure of their paper.

mritun · on Feb 25, 2020

The “...” at the end of the phrase is the give away that they couldn’t fit the whole article in the picture. Checkout the complete paper.

vbarrielle · on Feb 25, 2020

I missed that ellipsis cue, thanks for pointing it out. The complete excerpt is not in the complete paper though, but it's probably in their released data.

Tepix · on Feb 25, 2020

I agree that it's a bad example. I suppose the ellipsis at the end indicates an omission of a whole section.

gwern · on Feb 25, 2020

If you want to train or run T5 for pure text generation (GPT-2-style), Nax developed a Colab notebook for that: https://twitter.com/NaxAlpha/status/1224912629967310848 https://colab.research.google.com/drive/1-ROO7L09EupLFLQM-TW...

j0e1 · on Feb 24, 2020

Link to the paper: https://arxiv.org/abs/1910.10683

vackosar · on Feb 25, 2020

Listen to the paper here you can https://youtu.be/gyBdnNY1WPI

derefr · on Feb 24, 2020

Now I'm curious what it'd give as output for the missing-tokens task if you specialized it on understanding SVG vector image data...

bowmessage · on Feb 24, 2020

Interesting thought! I think you'd have to provide more than the image XML though. The locality of XML elements in text form doesn't necessarily correspond with their locality in rendered image form, that would be tricky as there wouldn't be a lot of "context" to go off of.

Jack000 · on Feb 25, 2020

would pre-training on non-English text (using a joint dictionary) improve translation performance?

I'm not sure if the information on http://nlpprogress.com/english/machine_translation.html is accurate, but it appears that the top translation results rely on backtranslation, boosting and other data augmentation techniques with a vanilla transformer model. It would be interesting to see the Bleu scores for T5 that's more optimized for translation specifically.

Ajedi32 · on Feb 25, 2020

It'd be really interesting to see if T5 could be made to perform a reading comprehension or summarization task where the source article is in a different language from the question and answer. Seems like a potentially interesting application for a model as flexible as this one.

foota · on Feb 24, 2020

In case anyone from the team is watching, the colab link at the bottom is broken.

craffel · on Feb 24, 2020

Thanks, fixed!

luismmolina · on Feb 24, 2020

it is working for me

ComputerGuru · on Feb 24, 2020

Isn’t that the name of Microsoft’s code generation templating language?

zamalek · on Feb 24, 2020

That's T4. The naming is pretty unfortunate.

perl4ever · on Feb 25, 2020

Not to be confused with the m4 language.

slyrus · on Feb 25, 2020

Of course every equestrian (or parent of one) would know that a sentence like "the course is jumping well" is totally legit.

baq · on Feb 25, 2020

> Text-To-Text Transfer Transformer (T5)

> Colossal Clean Crawled Corpus (C4)

pun detector at 3.6 punits. not great, not terrible.

flying_sheep · on Feb 25, 2020

This is epic...

> Q: What is the opposite of an acid?

> You: alkaline

> T5: Alkali

> Correct answer: a Base

:-)

eutectic · on Feb 25, 2020

Did you consider incorporating convolution into the model ala the evolved transformer?

riazrizvi · on Feb 24, 2020

> the full 11 billion parameter model achieves the exact text of the answer 50%[ to 30% of the time]

If you need to tweak 11-billion parameters to get a particular result, I don’t see how you can call whatever is being called a model, more like a component of a model.

edsonmedina · on Feb 25, 2020

Social-media political bots about to get harder to detect.

eutectic · on Feb 25, 2020

Did you consider incorporating convolution a la 'the evolved transformer'?

mc3 · on Feb 25, 2020

Ironically this can be used to create search engine spam!

kizer · on Feb 25, 2020

Will it be able to learn the universal Turing machine? We could "train" a compiler & runtime into existence. Probably too much of a correctness constraint (since the output has to be the exact result of computing the encoded program with the encoded machine).

gitgud · on Feb 25, 2020

> "With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings..."

So exactly like the "unix pipe" philosophy invented 47 years ago?

I guess ideas are cyclical...