Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This looks fantastic. In particular, the focus on many-class classification is important, it's a common real-world task that is often overlooked. I have some suggestions:

More types of baseline accuracy measure would be useful, eg. accuracy, and micro and macro f1 with unbalanced classes.

It would be very useful to know inter-annotator agreement for the manual classification and human performance for the task of identifying the original subreddit. I'm not a huge fan of creating artificial categories when natural ones are available. In practice there will be a real difference between the 26th and 27th league of legends subreddit, it might be some subtle topical focus shift or something political or tonal.

Is there some kind of standard measure for trading precision and recall when classifying in a hierarchical class structure? That is, you start predicting general high-level categories and move down to the most specific class you can get to before confidence falls below a threshold? Then the evaluation measure gives you more credit for getting lower down the tree (rewarding information gain in the class hierarchy).



A nice comment, good to see other people are thinking about this! I agree with you about the imbalanced classes, I do have a copy of this, the main issue is that the way this dataset was created was only looking at subreddits which include 1000 posts or more, meaning that class imbalance is somewhat unrealistic, if I do publish an imbalanced version it will include all subreddits, not just the carefully selected 1013.

re: the reason for the artifice. First of all note that none of the labels in here are exactly superficial - I have made a taxonomy - but I have only used this to filter out subreddits - I did not combine the posts from different subreddits in the same category here. The main reason was to combat the fact that these are not great labels otherwise - many subreddits are subsets of others - e.g. you have r/gaming -> r/finalfantasy -> r/FFVIII - and you don't know that this follows a hierarchy a priori (N.B. categorising all subreddits would require significant resources).

Worse than this, you have subreddits that don't really follow any obvious kind of categorisation, e.g. r/askreddit (by far the most populous in terms of self-posts), or more randomly, subreddits devoted to podcasts like r/joerogan - they are basically places where people go for broad, likeminded chat, and they can overlap with just about anything. I would argue that this is actually not always realistic - for the examples I have worked on in the past, labels were reasonably unambiguous.


IME, the naive solution to hierarchical classification - building a classifier for each level of the hierarchy - gets me to ~85% accuracy, compared to ~75% accuracy using a "global" classifier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: