An ImageNet-like text classification task based on Reddit posts

samcodes · on Oct 29, 2018

The post mentions not getting great results with OpenAI Transformer. I haven't tried that, but using a similar framework, ULM-FiT, I narrowly beat the fasttext benchmark on a 250-class dataset we use internally. I will follow up with how it does on this data set.

sh33mp · on Oct 29, 2018

ULM-FiT and OpenAI's Transformer* are quite different. Both are pretrained language-models, but ULM-FiT is a standard stack of LSTMs with a particular recipe for fine-tuning, whereas the OpenAI's Transformer uses the much newer Transformer architecture, and no really fancy tricks in the actual fine-tuning. I suspect the difficulty is with the Transformer model itself - this is not the first time I've heard that it is difficult to train.

* = To be clear, this refers to OpenAI's pretrained Transformer model. The Transformer architecture was from work at Google.

gargarplex · on Oct 29, 2018

Do any freelance work? We have a small fastText project. Email in profile if you're interested.

sweezyjeezy · on Oct 30, 2018

I'd be very interested to know, thanks!

ppod · on Oct 29, 2018

This looks fantastic. In particular, the focus on many-class classification is important, it's a common real-world task that is often overlooked. I have some suggestions:

More types of baseline accuracy measure would be useful, eg. accuracy, and micro and macro f1 with unbalanced classes.

It would be very useful to know inter-annotator agreement for the manual classification and human performance for the task of identifying the original subreddit. I'm not a huge fan of creating artificial categories when natural ones are available. In practice there will be a real difference between the 26th and 27th league of legends subreddit, it might be some subtle topical focus shift or something political or tonal.

Is there some kind of standard measure for trading precision and recall when classifying in a hierarchical class structure? That is, you start predicting general high-level categories and move down to the most specific class you can get to before confidence falls below a threshold? Then the evaluation measure gives you more credit for getting lower down the tree (rewarding information gain in the class hierarchy).

sweezyjeezy · on Oct 30, 2018

A nice comment, good to see other people are thinking about this! I agree with you about the imbalanced classes, I do have a copy of this, the main issue is that the way this dataset was created was only looking at subreddits which include 1000 posts or more, meaning that class imbalance is somewhat unrealistic, if I do publish an imbalanced version it will include all subreddits, not just the carefully selected 1013.

re: the reason for the artifice. First of all note that none of the labels in here are exactly superficial - I have made a taxonomy - but I have only used this to filter out subreddits - I did not combine the posts from different subreddits in the same category here. The main reason was to combat the fact that these are not great labels otherwise - many subreddits are subsets of others - e.g. you have r/gaming -> r/finalfantasy -> r/FFVIII - and you don't know that this follows a hierarchy a priori (N.B. categorising all subreddits would require significant resources).

Worse than this, you have subreddits that don't really follow any obvious kind of categorisation, e.g. r/askreddit (by far the most populous in terms of self-posts), or more randomly, subreddits devoted to podcasts like r/joerogan - they are basically places where people go for broad, likeminded chat, and they can overlap with just about anything. I would argue that this is actually not always realistic - for the examples I have worked on in the past, labels were reasonably unambiguous.

samcodes · on Oct 29, 2018

IME, the naive solution to hierarchical classification - building a classifier for each level of the hierarchy - gets me to ~85% accuracy, compared to ~75% accuracy using a "global" classifier.

minimaxir · on Oct 29, 2018

Fun fact: Reddit uses a similar NLP technique as the t-SNE vizzes to combine the content of subreddits for building recommendations: https://www.youtube.com/watch?v=tKISLQ87GO8

thesehands · on Oct 29, 2018

Thanks for posting this. Super cool to see how they are suggesting subreddits

nwsm · on Oct 29, 2018

Great video. A bit sickening to listen to him talk about how great and addicting their site is, but an interesting video on their analytics tech

Scaevolus · on Oct 29, 2018

You can get similar raw data using these public bigquery datasets: https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit...

nwsm · on Oct 29, 2018

Are you breaking Reddit TOS by storing/hosting posts?

If someone deletes their post on Reddit it will still be stored and available on your site.

sweezyjeezy · on Oct 30, 2018

I'm not too worried about this. The number of publicly available datasets of reddit posts , many of them hosted on kaggle or bigquery (both owned by google) is very large, which suggests to me that reddit don't mind about this in the same way that say, Twitter does. It was also a deliberate decision to not include the reddit usernames in this data, and I personally don't think this would be a great resource to try and break someones privacy, compared to what else is out there.

That said, if either Kaggle or reddit do have a problem with this I won't hesitate to remove it.

sosorry44 · on Oct 29, 2018

Why would OP care about reddit TOS?

super-serial · on Oct 30, 2018

Because violating copyright laws could get him sued and jeopardize his startup?