This post - https://news.ycombinator.com/item?id=28551183 - suggests it's a simple set of hueristics, looking for things like javascript, link/SEO spam, language, amount of text content, etc, filtering out unwanted results and only indexing wanted ones.