The post mentions not getting great results with OpenAI Transformer. I haven't tried that, but using a similar framework, ULM-FiT, I narrowly beat the fasttext benchmark on a 250-class dataset we use internally. I will follow up with how it does on this data set.
ULM-FiT and OpenAI's Transformer* are quite different. Both are pretrained language-models, but ULM-FiT is a standard stack of LSTMs with a particular recipe for fine-tuning, whereas the OpenAI's Transformer uses the much newer Transformer architecture, and no really fancy tricks in the actual fine-tuning. I suspect the difficulty is with the Transformer model itself - this is not the first time I've heard that it is difficult to train.
* = To be clear, this refers to OpenAI's pretrained Transformer model. The Transformer architecture was from work at Google.