The visualizations seem to show non-recurrent networks whereas my understanding is that one of the important differences between GPT1 and GPT2 & 3 is the use of recurrent networks.
This allows the output to loop backwards, providing a rudimentary form of memory / context beyond just the input vector.
This allows the output to loop backwards, providing a rudimentary form of memory / context beyond just the input vector.