You are setting up a straw man by saying "at it's simplest an index is a list of...

munificent · on May 26, 2011

> "The index" would just be a list of key/values. The key would be a URL, and the value would be the content located at that URL.

They have that index already. It's called "the internet". If you're storing the entire page content, that's not really much of an "index".

> it would be stupid store such a rich data set in such a dumb, information-losing format.

Discarding information is the entire purpose of indexing and doing things like map-reduce (emphasis on the reduce!): you discard worthless information so that the stuff you care about is of a manageable size.

The reason Google serves up search results quickly is because it carefully controls the data it has to wade through. Adding more data doesn't make it smarter, it makes it slower.

oh_sigh · on May 26, 2011

> They have that index already. It's called "the internet". If you're storing the entire page content, that's not really much of an "index".

Fine. Let's call it an archive, or cache, or snapshot then. I only used that phrase because it was thrown around in the post I was responding to, and I specifically put that in quotes because I wasn't sure it was the right phrase to use.

> Discarding information is the entire purpose of indexing and doing things like map-reduce (emphasis on the reduce!)

No. Just no. An index is a data structure that is designed to quickly lookup its elements based on some key. It implies nothing about what your data(neither keys nor elements) look like.

Are you serious about "emphasis on reduce!"? "Reduce" is not named reduce because it inherently reduces the amount of data you are working with. It is called "reduce" because 1) an engineer at google liked the sound of it, and 2) it takes a list of intermediary values with all the same key. It is quite easy, and common, to have reduce functions which end up spitting out MORE data than you started with

> you discard worthless information so that the stuff you care about is of a manageable size.

Sometimes, but that doesn't work in this situation. How do you know what data is worthless before you know what algorithms will be applied to the data? The only acceptable solution is to keep a plaintext copy of the data retrieved from a particular URL.