Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

why don't you use the credits to try to create a publicly queryable index of the web in a standardised format? read: open source search engine. as you've got the money, just ignore efficiency.

else... commit part of it to one of the computing @ home proejcts?



If you're interested in a publicly queryable index of the web, you could try running a search server such as ElasticSearch on the Common Crawl[1] corpus. ElasticSearch runs the search backend of WordPress, 600 million+ documents in total[2], so extending it to a Common Crawl archive seems possible.

n.b. I'm a data scientist at Common Crawl, so have a vested interest!

Also, whatever experiment you end up pursuing, remember to use spot instances if your setup allows for transient nodes - it'll substantially decrease your burn rate (usually 1/10th the price) allowing for even larger and more insane experiments :)

[1]: http://commoncrawl.org/

[2]: http://gibrown.com/2014/01/09/scaling-elasticsearch-part-1-o...


I had a crawling project where I wanted to get a sense for a few ad-related things on the internet and came upon common crawl and was initially excited since I thought it would have incidentally captured the data I wanted, but I was disappointed to find that they did not do any kind of JS execution, which limited the effectiveness for me pretty drastically.


I'd never heard of Common Crawl before but it looks like an awesome project! Keep up the good work!


How up-to-date is commoncrawl data?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: