Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The visualizations seem to show non-recurrent networks whereas my understanding is that one of the important differences between GPT1 and GPT2 & 3 is the use of recurrent networks.

This allows the output to loop backwards, providing a rudimentary form of memory / context beyond just the input vector.



While models such as XLNet incorporate recurrence, GPT-{2,3} is mostly just a plain decoder-only transformer model.[1]

[1]https://arxiv.org/abs/2005.14165 [2]https://d4mucfpksywv.cloudfront.net/better-language-models/l...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: