Did you really crawl Google? That has to be a long time ago. But speaking about searching on Google as a user:
Google's Advanced search used to be a great tool, until around 2007/08. For some reason it never received an upgrade and several things are broken or don't work any more or were removed (e.g. '+' which is now a keyword for Google+, the '"' does mean the same; e.g some filetypes are blocked, some show only a few results).
so you mean like "term1" "term2" -"term3" -"term4"? or if I wanted to do this without returning results from hackernews "term1" ... -"term4" -site:news.ycombinator.com ?
The problem is "whatever tiny 'power user' features that google had... don't seem to work at all now."
I think I know what they were talking about. A lot of times it appears that adding advanced terms to a query will change the estimated number of results yet all the top hits will be exactly the same. Also, punctuation seems to be largely ignored, e.g. searching "etc apt sources list" and "/etc/apt/sources.list" both give me the exact same results. Putting the filename in quotes also gives the same results as before.
Searching for specific error messages with more than a few key words or a filename is usually a nightmare.
Is there any truth to my suspicion that the web of hyperlinks (on which the famed algorithm relied) is significantly weaker and reaches fewer corners these days?
Certainly feels like content is migrating to the walled gardens and there are fewer and fewer personal websites injecting edges into the open graph.
Last November I speculated why Google would let HTTP/2 get standardized without specifying the use of SRV records:
“This is going to bite them big time in the end, because Google got large by indexing the Geocities-style web, where everybody did have their own web page on a very distributed set of web hosts. What Google is doing is only contributing to the centralization of the Web, the conversion of the Web into Facebook, which will, in turn, kill Google, since they then will have nothing to index.
They sort of saw this coming, but their idea of a fix was Google+ – trying to make sure that they were the ones on top. I think they are still hoping for this, which is why they won’t allow a decentralized web by using SRV records in HTTP/2.”
It feels the same way to me. To a large percent of users the internet is Facebook rather than the largest compendium of human knowledge in existence, but lucky for us that use it for the latter reason that value of such a thing will always be evident.
Come to think of it, we moved offices 3 times since then, must've been 8-10 years ago. I don't think I had to do any special trickery, I spend only an afternoon or so writing and testing the code. I didn't realize such a thing would be impossible now - what a shame. I downloaded several gigabytes iirc - a big amount at the time.
Though now a day you could use Common Crawl to get the dataset and use existing tools to extract such files, right? (I've no idea if that's a practical thing to do or not.)
I guess so, if they "look" at the web the same way Google does (respecting robots.txt, nofollow etc - which Wikipedia says they do). But the interesting things are found in nooks and crannies where nobody else has thought of looking before - so relying on someone else to do the heavy lifting is probably the wrong way to go about it...
Google's Advanced search used to be a great tool, until around 2007/08. For some reason it never received an upgrade and several things are broken or don't work any more or were removed (e.g. '+' which is now a keyword for Google+, the '"' does mean the same; e.g some filetypes are blocked, some show only a few results).