You are setting up a straw man by saying "at it's simplest an index is a list of what words are in what documents". That kind of index(an inverted word index) could be generated from the ideal data format, but it would be stupid store such a rich data set in such a dumb, information-losing format.
"The index" would just be a list of key/values. The key would be a URL, and the value would be the content located at that URL. There would also be some kind of metadata attached to the keys to indicate HTTP status code, HTTP header information, last-crawled date, and any other interesting data. From this data set, other, more appropriate indexes could be generated(for example via hadoop)
> In the end, what the author really wants is for someone to maintain a separate copy of the internet for bots.
Yes, but not for bots. It would be for algorithms.
> In order for someone to do that, they'd need to charge the bot owners
Probably.. The funding could be like ICANN, whose long-term funding comes from groups that benefit from its services.
> but the bot owners could just index your content for free, so why would they pay?
How would you create a copy of the internet for free? Are you just going to run your crawler on your home modem while you're at work? Where are you going to store all that data? How are you going to process it? How long is that going to take? Wouldn't it just be easier to(for example) mount a shared EBS volume in EC2 that has the latest, most up to date crawled internet available for your processing?
> "The index" would just be a list of key/values. The key would be a URL, and the value would be the content located at that URL.
They have that index already. It's called "the internet". If you're storing the entire page content, that's not really much of an "index".
> it would be stupid store such a rich data set in such a dumb, information-losing format.
Discarding information is the entire purpose of indexing and doing things like map-reduce (emphasis on the reduce!): you discard worthless information so that the stuff you care about is of a manageable size.
The reason Google serves up search results quickly is because it carefully controls the data it has to wade through. Adding more data doesn't make it smarter, it makes it slower.
> They have that index already. It's called "the internet". If you're storing the entire page content, that's not really much of an "index".
Fine. Let's call it an archive, or cache, or snapshot then. I only used that phrase because it was thrown around in the post I was responding to, and I specifically put that in quotes because I wasn't sure it was the right phrase to use.
> Discarding information is the entire purpose of indexing and doing things like map-reduce (emphasis on the reduce!)
No. Just no. An index is a data structure that is designed to quickly lookup its elements based on some key. It implies nothing about what your data(neither keys nor elements) look like.
Are you serious about "emphasis on reduce!"? "Reduce" is not named reduce because it inherently reduces the amount of data you are working with. It is called "reduce" because 1) an engineer at google liked the sound of it, and 2) it takes a list of intermediary values with all the same key. It is quite easy, and common, to have reduce functions which end up spitting out MORE data than you started with
> you discard worthless information so that the stuff you care about is of a manageable size.
Sometimes, but that doesn't work in this situation. How do you know what data is worthless before you know what algorithms will be applied to the data? The only acceptable solution is to keep a plaintext copy of the data retrieved from a particular URL.
"The index" would just be a list of key/values. The key would be a URL, and the value would be the content located at that URL. There would also be some kind of metadata attached to the keys to indicate HTTP status code, HTTP header information, last-crawled date, and any other interesting data. From this data set, other, more appropriate indexes could be generated(for example via hadoop)
> In the end, what the author really wants is for someone to maintain a separate copy of the internet for bots.
Yes, but not for bots. It would be for algorithms.
> In order for someone to do that, they'd need to charge the bot owners
Probably.. The funding could be like ICANN, whose long-term funding comes from groups that benefit from its services.
> but the bot owners could just index your content for free, so why would they pay?
How would you create a copy of the internet for free? Are you just going to run your crawler on your home modem while you're at work? Where are you going to store all that data? How are you going to process it? How long is that going to take? Wouldn't it just be easier to(for example) mount a shared EBS volume in EC2 that has the latest, most up to date crawled internet available for your processing?