Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
T5: The Text-to-Text Transfer Transformer (googleblog.com)
338 points by theafh on Feb 24, 2020 | hide | past | favorite | 66 comments


The 'fill in the N blanks' results at the end are fascinating! N<64 are all pretty normal, but then for N=64 and N=512, it starts going on about the old 1930s cookbook it has and its grad school experiences! Wild. I think I would not be able to distinguish this from a selection of real Amazon reviews or similar informal text.


And that is because a good chunk of amazon reviews are written with various machine-generation systems...


It reads like a typical blog recipe introduction.


Which makes sense, given the corpus they trained T5 on.


Question: How did they obtain the Colossal Clean Crawled Corpus (C4) they mention in the article?

Options:

1. "Mechanical Turk" style, a massive undertaking to manually clean up Common Crawl, perhaps using underpaid labor in third world countries (such as samasource.com does)

2. By means of somehow getting the internet to do it for them with something like reCAPTCHA

3. With the help of machine learning / traditional text processing

4. Some other way

Anyone has any ideas? I'm intrigued. The paper [https://arxiv.org/pdf/1910.10683.pdf] and the website [https://www.tensorflow.org/datasets/catalog/c4] mention almost nothing, except for an option to switch off the cleaning & deduplication, which hints at option number 3.


In section 2.2 of the paper they describe the process they use: applying a series of heuristic rules to the text. (Also the dataset is 750Gb... )


Ah wow, thanks! Not sure how I missed that. For other interested parties, here's the key section:

> Unfortunately, the majority of [the text in Common Crawl] is not natural language. Instead, it largely comprises gibberish or boiler-plate text like menus, error messages, or duplicate text. Furthermore, a good deal of the scraped text contains content that is unlikely to be helpful for any of the tasks we consider (offensive language, placeholder text, source code, etc.). To address these issues, we used the following heuristics for cleaning up Common Crawl’s web extracted text:

•We only retained lines that ended in a terminal punctuation mark (i.e. a period, exclamation mark, question mark, or end quotation mark).

•We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”. [https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...]

•Many of the scraped pages contained warnings stating that Javascript should be enabled so we removed any line with the word Javascript.

•Some pages had placeholder “lorem ipsum” text; we removed any page where the phrase “lorem ipsum” appeared.

•Some pages inadvertently contained code. Since the curly bracket “{” appears in many programming languages (such as Javascript, widely used on the web) but not in natural text,we removed any pages that contained a curly bracket.

•To deduplicate the dataset, we discarded all but one of any three-sentence span occurring more than once in the dataset.

Additionally, since most of our downstream tasks are focused on English-language text, we used langdetect [https://pypi.org/project/langdetect/] to filter out any pages that were not classified as English with a probability of at least 0.99.


> We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”. [https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...]

Looking at that list, I wonder what the unintended consequences of a decision like this is. If you want to create something related to sentiment analysis, that swear words you discarded is a useful signal, not noise right? If you wanted to use the dataset somehow for your tour guide business in Austria, how does it handle the the village called Fucking? Does T5 understand the British colloquialism for cigarettes? Can ornithologists talk to it about penguins and eagles, but not about yellow-bellied tits and blue-footed boobies?


That made me think of something along the lines of "backdooring a dataset" by introducing some hard to find but easy to trigger failure modes or fingerprinting for any application built on top of it.


Sounds like an awesome idea, put some easter eggs in the common crawl to compromise the future of NLP.


...not to mention Rhenquist, Brownmiller, Potter Stewart, etc.


Code to reproduce can be found here: https://www.tensorflow.org/datasets/catalog/c4

We are also talking to Common Crawl to see if they will host a prepared copy since we do not have redistribution rights.


To put these results in perspective, the T5 team went head-to-head with the model in a pub trivia challenge and lost!

Trivia fact recall and NLP seem like two quite different tasks even though both are required to do well in a quiz.


Agreed! The interesting thing is that basic unsupervised pre-training seems to produce a model which functions not only as a knowledge base but also an NLU system which can effectively query the knowledge base using natural text questions. This is exactly what our follow-up paper is on.


The trivia game (https://t5-trivia.glitch.me/) needs a little work.

> Q: How did Gus Grissom, Ed White and Roger B. Chaffee die in 1967?

> You: "Apollo 1" WRONG

> T5: "They were killed when their Apollo 1 spacecraft exploded" WRONG

> Correct answer: burned to death

> Q: Which Alpine peak is known in Italy as Monte Cervino?

> You: "Monte Cervino" CORRECT

I wonder how many of the problems with this game could be fixed by applying T5 itself to the answer grading.


Yes, unfortunately we have to rely on the very brittle "exact match" method of evaluating whether an answer is correct. FWIW and perhaps surprisingly, this is the primary way question-answering systems are evaluated in common benchmarks. I totally agree that fine-tuning T5 for answer grading would be super interesting!


I think it makes some sense to evaluate models like this, as you want to be conservative with the answers you accept (though my second example shows that it isn't always conservative), and models don't have feelings to hurt if they are docked points for not being precise enough. Humans, of course, are more sensitive.


Does that mean that answer grading would become like comparing summaries of a given text?


I'm sorry for being blunt, but is it possible that the `very brittle "exact match" method of evaluating whether an answer is correct` means value equality? Is `==` the secret sauce?


It's slightly more than that -- it also involves lowercasing and removing articles before testing for string equality.


Why are you replying to every single comment?


I think craffel (probably "Colin Raffel, Senior Research Scientist, Google Research") was directly involved in this research!


Yes, that's me! Sorry if I'm being overeager, I like talking about my research!


I think it's amazing how frequently people involved in various CS and IT things are directly participating in threads about their work here on HN.


”Correct answer: burned to death”

More likely: suffocated (as with most fire deaths) https://history.nasa.gov/Apollo204/invest.html:

”d. MEDICAL ANALYSIS

Loss of consciousness was due to cerebral hypoxia due to cardiac arrest resulting from myocardial hypoxia. Factors of temperature, pressure and environmental concentrations of carbon monoxide, carbon dioxide, oxygen and pulmonary irritants were changing extremely rapidly. It is impossible to integrate these variables on the basis of available information with the dynamic physiological and metabolic conditions they produced, in order to arrive at a precise statement of time when consciousness was lost and when death supervened. The combined effect of these environmental factors dramatically increased the lethal effect of any factor by itself. It is estimated that consciousness was lost between 15 and 30 seconds after the first suit failed. Chances of resuscitation decreased rapidly thereafter and were irrevocably lost within 4 minutes.”


Yeah, it asked who was the director of the CIA from 1976-1981. I answered "George Bush" (lazy, typing on phone), it answered the more exact "George H. W. Bush". It marked me correct, but it was marked wrong. It also asked me what country the island of "Honsh" was in (should be Honshu). I answered Japan and was correct, though. It guessed Iceland.


TIL: Leonardo Di Caprio, not Robert Redford, played Jay Gatsby in a 1974 adaptation of The Great Gatsby.


Reminds me of the early Unix ‘quiz’ game.


Anyone know what's new in the blogpost? T5 has been out for a few months now.


The blogpost has a summary of our paper from October (a bit late, sorry!) but also has some (fun?) new results on closed-book question answering and fill-in-the-blank text generation.


No mention of MASS by Microsoft? It was afaik one of the first pretraining schemes for a full transformer outside of XLM.

Imho a bit unfortunate, as is calling the decoder or the encoder of a transformer "a transformer", as it has happened with GPT and BERT, which now forces people to use "full transformer" or using phrases like the title of the blog post.


We include MASS in our empirical survey (see e.g. section 3.3.2 of our paper, https://arxiv.org/pdf/1910.10683.pdf). FWIW, people were pre-training Transformers before MASS, e.g. "Improving Language Understanding by Generative Pre-Training" by Radford et al. from 2018. Even further back, "Semi-Supervised Sequence Learning" by Dai et al. describe pre-training an RNN encoder-decoder model for subsequent transfer.


But Radford is just pretraining the decoder and qualitatively different from a seq2seq approach such as MASS. If we just look at the original paper from Vaswani, than "pretraining a transformer" imho should always only have meant pretraing the encoder and decoder. Obviously that ship has sailed.


Wonder how this would compare with Watson at playing Jeopardy...


I would assume most QA (question answering) models blow Watson out of the water. A lot has been done since then. See: https://aclweb.org/aclwiki/Question_Answering_(State_of_the_...


(I've done work in QA and have played at building Jeopardy style QA models)

Watson (Jeopardy Watson, not the IBM branding exercise Watson is now) has much weaker text understanding models, but has much much better optimisations for the incremental style of data release that you see in Jeopardy (ie, you get more and more data the longer you listen). IBM did a lot of work optimising when to answer as well as trying to get the correct answer.

The closest analogy that is regularly studied in modern QA research is "Quizbowl"-style datasets, but these tend to be much smaller than the SQUAD datasets that most modern neural network QA systems are built against.


Sure, but Jeopardy is all about AQ (Answer-Questioning) :)


Probably better than this: https://twitter.com/jeopardygoat


How would this perform on the reading comprehension part of the SAT?


The example figure has a weird entry for the summary task. The input is:

"summarize: state authorities dispatched emergency crews tuesday to survey the damage after an onslaught of severe weather in mississippi..."

And the output is:

"six people hospitalized after a storm in attala county."

That's quite a bad summary, no mention of "six" people in the original text, no mention of hospitalization. And "attala" county is too specific, a precision not present in the original text.

If that's the result of their model, that's not good. If it's coming from the training set, it's an even bigger problem. I guess it's the result of the model, because some issues can be explained by correlations ("emergency" correlates with "hospital", "mississipi" correlates with "attala").

I'm wondering why they chose this example for the flagship figure of their paper.


The “...” at the end of the phrase is the give away that they couldn’t fit the whole article in the picture. Checkout the complete paper.


I missed that ellipsis cue, thanks for pointing it out. The complete excerpt is not in the complete paper though, but it's probably in their released data.


I agree that it's a bad example. I suppose the ellipsis at the end indicates an omission of a whole section.


If you want to train or run T5 for pure text generation (GPT-2-style), Nax developed a Colab notebook for that: https://twitter.com/NaxAlpha/status/1224912629967310848 https://colab.research.google.com/drive/1-ROO7L09EupLFLQM-TW...



Listen to the paper here you can https://youtu.be/gyBdnNY1WPI


Now I'm curious what it'd give as output for the missing-tokens task if you specialized it on understanding SVG vector image data...


Interesting thought! I think you'd have to provide more than the image XML though. The locality of XML elements in text form doesn't necessarily correspond with their locality in rendered image form, that would be tricky as there wouldn't be a lot of "context" to go off of.


would pre-training on non-English text (using a joint dictionary) improve translation performance?

I'm not sure if the information on http://nlpprogress.com/english/machine_translation.html is accurate, but it appears that the top translation results rely on backtranslation, boosting and other data augmentation techniques with a vanilla transformer model. It would be interesting to see the Bleu scores for T5 that's more optimized for translation specifically.


It'd be really interesting to see if T5 could be made to perform a reading comprehension or summarization task where the source article is in a different language from the question and answer. Seems like a potentially interesting application for a model as flexible as this one.


In case anyone from the team is watching, the colab link at the bottom is broken.


Thanks, fixed!


it is working for me


Isn’t that the name of Microsoft’s code generation templating language?


That's T4. The naming is pretty unfortunate.


Not to be confused with the m4 language.


Of course every equestrian (or parent of one) would know that a sentence like "the course is jumping well" is totally legit.


> Text-To-Text Transfer Transformer (T5)

> Colossal Clean Crawled Corpus (C4)

pun detector at 3.6 punits. not great, not terrible.


This is epic...

> Q: What is the opposite of an acid?

> You: alkaline

> T5: Alkali

> Correct answer: a Base

:-)


Did you consider incorporating convolution into the model ala the evolved transformer?


> the full 11 billion parameter model achieves the exact text of the answer 50%[ to 30% of the time]

If you need to tweak 11-billion parameters to get a particular result, I don’t see how you can call whatever is being called a model, more like a component of a model.


Social-media political bots about to get harder to detect.


Did you consider incorporating convolution a la 'the evolved transformer'?


Ironically this can be used to create search engine spam!


Will it be able to learn the universal Turing machine? We could "train" a compiler & runtime into existence. Probably too much of a correctness constraint (since the output has to be the exact result of computing the encoded program with the encoded machine).


> "With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings..."

So exactly like the "unix pipe" philosophy invented 47 years ago?

I guess ideas are cyclical...




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: