We're aiming to do monthly crawls from this point on. The main hold was automating the intensive manual steps of our crawl process. Now we have scripts that make running our 100 node EC2 cluster and processing the terabytes of web data relatively trivial.
If anyone wants to discuss sourcing well distributed crawl lists for billions of pages per month, we'd love to chat. We want to make sure we cover a diverse variety of languages and domains. Given that we're trying to get a good sample of the web, that's a difficult proposition!
If anyone wants to discuss sourcing well distributed crawl lists for billions of pages per month, we'd love to chat. We want to make sure we cover a diverse variety of languages and domains. Given that we're trying to get a good sample of the web, that's a difficult proposition!