The post mentions not getting great results with OpenAI Transformer. I haven't tried that, but using a similar framework, ULM-FiT, I narrowly beat the fasttext benchmark on a 250-class dataset we use internally. I will follow up with how it does on this data set.
ULM-FiT and OpenAI's Transformer* are quite different. Both are pretrained language-models, but ULM-FiT is a standard stack of LSTMs with a particular recipe for fine-tuning, whereas the OpenAI's Transformer uses the much newer Transformer architecture, and no really fancy tricks in the actual fine-tuning. I suspect the difficulty is with the Transformer model itself - this is not the first time I've heard that it is difficult to train.
* = To be clear, this refers to OpenAI's pretrained Transformer model. The Transformer architecture was from work at Google.
This looks fantastic. In particular, the focus on many-class classification is important, it's a common real-world task that is often overlooked. I have some suggestions:
More types of baseline accuracy measure would be useful, eg. accuracy, and micro and macro f1 with unbalanced classes.
It would be very useful to know inter-annotator agreement for the manual classification and human performance for the task of identifying the original subreddit. I'm not a huge fan of creating artificial categories when natural ones are available. In practice there will be a real difference between the 26th and 27th league of legends subreddit, it might be some subtle topical focus shift or something political or tonal.
Is there some kind of standard measure for trading precision and recall when classifying in a hierarchical class structure? That is, you start predicting general high-level categories and move down to the most specific class you can get to before confidence falls below a threshold? Then the evaluation measure gives you more credit for getting lower down the tree (rewarding information gain in the class hierarchy).
A nice comment, good to see other people are thinking about this! I agree with you about the imbalanced classes, I do have a copy of this, the main issue is that the way this dataset was created was only looking at subreddits which include 1000 posts or more, meaning that class imbalance is somewhat unrealistic, if I do publish an imbalanced version it will include all subreddits, not just the carefully selected 1013.
re: the reason for the artifice. First of all note that none of the labels in here are exactly superficial - I have made a taxonomy - but I have only used this to filter out subreddits - I did not combine the posts from different subreddits in the same category here. The main reason was to combat the fact that these are not great labels otherwise - many subreddits are subsets of others - e.g. you have r/gaming -> r/finalfantasy -> r/FFVIII - and you don't know that this follows a hierarchy a priori (N.B. categorising all subreddits would require significant resources).
Worse than this, you have subreddits that don't really follow any obvious kind of categorisation, e.g. r/askreddit (by far the most populous in terms of self-posts), or more randomly, subreddits devoted to podcasts like r/joerogan - they are basically places where people go for broad, likeminded chat, and they can overlap with just about anything. I would argue that this is actually not always realistic - for the examples I have worked on in the past, labels were reasonably unambiguous.
IME, the naive solution to hierarchical classification - building a classifier for each level of the hierarchy - gets me to ~85% accuracy, compared to ~75% accuracy using a "global" classifier.
Fun fact: Reddit uses a similar NLP technique as the t-SNE vizzes to combine the content of subreddits for building recommendations: https://www.youtube.com/watch?v=tKISLQ87GO8
I'm not too worried about this. The number of publicly available datasets of reddit posts , many of them hosted on kaggle or bigquery (both owned by google) is very large, which suggests to me that reddit don't mind about this in the same way that say, Twitter does. It was also a deliberate decision to not include the reddit usernames in this data, and I personally don't think this would be a great resource to try and break someones privacy, compared to what else is out there.
That said, if either Kaggle or reddit do have a problem with this I won't hesitate to remove it.