This looks fantastic. In particular, the focus on many-class classification is i...

sweezyjeezy · on Oct 30, 2018

A nice comment, good to see other people are thinking about this! I agree with you about the imbalanced classes, I do have a copy of this, the main issue is that the way this dataset was created was only looking at subreddits which include 1000 posts or more, meaning that class imbalance is somewhat unrealistic, if I do publish an imbalanced version it will include all subreddits, not just the carefully selected 1013.

re: the reason for the artifice. First of all note that none of the labels in here are exactly superficial - I have made a taxonomy - but I have only used this to filter out subreddits - I did not combine the posts from different subreddits in the same category here. The main reason was to combat the fact that these are not great labels otherwise - many subreddits are subsets of others - e.g. you have r/gaming -> r/finalfantasy -> r/FFVIII - and you don't know that this follows a hierarchy a priori (N.B. categorising all subreddits would require significant resources).

Worse than this, you have subreddits that don't really follow any obvious kind of categorisation, e.g. r/askreddit (by far the most populous in terms of self-posts), or more randomly, subreddits devoted to podcasts like r/joerogan - they are basically places where people go for broad, likeminded chat, and they can overlap with just about anything. I would argue that this is actually not always realistic - for the examples I have worked on in the past, labels were reasonably unambiguous.

samcodes · on Oct 29, 2018

IME, the naive solution to hierarchical classification - building a classifier for each level of the hierarchy - gets me to ~85% accuracy, compared to ~75% accuracy using a "global" classifier.