Time series prediction is always about using the particular features of your distribution of time series. In standard time series prediction the features of the distribution are mostly things like "periodic patterns are continued" or "growth patterns are continued". A transformer that is trained on language data essentially learns time series prediction where a large variety of complex feature appear that influence the continuation. Language data is so complex and diverse that continuing a text necessitates in-context learning: Being able to find some common features in any kind of string of symbols, and using those to continue the text. Just think that language data could contain huge excel tables of various data, like stock market prices, or weather recordings. It is therefore plausible that in-context learning can be very powerful, enough to perform zero-shot time series continuation. Moreover, I believe that due to in-context learning language data + transformer architecture has the potential to really obtain general intelligence like behaviour. General pattern recognition. Language data is complex enough that SGD must lead to general pattern recognition and continuation. We are only at the beginning, and right now we are focused on finetuning which destroys in-context learning. But we will soon train giant transformers on every modality, every string of symboly we can find.