why don't you use the credits to try to create a publicly queryable index of the...

Smerity · on May 14, 2015

If you're interested in a publicly queryable index of the web, you could try running a search server such as ElasticSearch on the Common Crawl[1] corpus. ElasticSearch runs the search backend of WordPress, 600 million+ documents in total[2], so extending it to a Common Crawl archive seems possible.

n.b. I'm a data scientist at Common Crawl, so have a vested interest!

Also, whatever experiment you end up pursuing, remember to use spot instances if your setup allows for transient nodes - it'll substantially decrease your burn rate (usually 1/10th the price) allowing for even larger and more insane experiments :)

[1]: http://commoncrawl.org/

[2]: http://gibrown.com/2014/01/09/scaling-elasticsearch-part-1-o...

Eridrus · on May 14, 2015

I had a crawling project where I wanted to get a sense for a few ad-related things on the internet and came upon common crawl and was initially excited since I thought it would have incidentally captured the data I wanted, but I was disappointed to find that they did not do any kind of JS execution, which limited the effectiveness for me pretty drastically.

th0br0 · on May 14, 2015

I'd never heard of Common Crawl before but it looks like an awesome project! Keep up the good work!

chrischen · on May 15, 2015

How up-to-date is commoncrawl data?