I still can't understand how this made to frontpage and received this much upvotes. How it was difficult to perform a Google search and what golden key did this tutorial share?
Previously, one of the problems I've seen is Google obfuscates the source on the results page. You can't just grep for what you want since it's not there. They did this in a push to get you to use their search api, but for some odd reason, their search API returns different results than a browser. I find this very useful in that regard.
I tried this a couple of months ago. I could easily extract the url of the result using regex. However, this url was a redirect url with the domain google.com. At this point, I made one more request to this redirect url, followed the redirect chain and easily obtained the actual url of the result.
My company is behind a giant honking NAT and we get blocked all the time. Unusual volume for our "residential" IP block I suppose. Google seems to do some averaging across CIDR blocks to detect likely bots. Any chance somebody is abusing your proxy IP? Is your proxy part of an IP allocation thats underutilized?
An easy way to get around it is to login to a Google account. Google allows you to use even the most trashed and abused TOR exit nodes as long as you're logged in.
I commented earlier on this post about it being impossible for a bot to do search on Google. Turns out I turned to the same resource you're recommending.
I'd also recommend sending a python bot to your favorite news sites once a day for updates instead. Same deal, Beautiful Soup.
On the other hand, Google has Custom Search, which is $100 a year for 20k queries. I use that as well.
It looks like Google is shutting down the paid version of Custom Search Engine:
"On April 1st, 2017, Google will discontinue the sales of Google Site Search, the paid version of Custom Search Engine. All new purchases and renewals must take place before this date. This product will be completely shut down by April 1st, 2018." Source: https://cse.google.com/cse/
This depends on the rate at which you are making requests. If the rate is low, then Google doesn't bother. I have successfully scraped Google by waiting longer between requests.
Traceback (most recent call last):
File "0a51cb60301828feeaf66cbc908297ae.py", line 9, in <module>
for j in search(query, tld="co.in", num=10, stop=1, pause=2):
NameError: name 'search' is not defined
try:
from google import search
except ImportError:
print("No module named 'google' found")
and just used
from google import search
One would have gotten a sane error message:
Traceback (most recent call last):
File "00359ada2bfcd6dbabd1fa0207e683b8.py", line 1, in <module>
from google import search
ImportError: No module named google
catching ImportError when you have no fallback is a pointless thing to do.
Maybe it was written by programmers trained on compiled languages with no interpreter to automatically catch all exceptions and print a traceback for you, so now they're cargo culting onto "Every exception must get caught to avoid undefined behavior."
They let some obvious bots through if you go slow enough. A lot of SEO and AdWords tracking companies still do unofficially tolerated automated google search.
I assume allowing some low amount of obvious botting prevents people from developing really sneaky bots that are much harder to block. It's probably to prevent an arms race that google may not be able to win.
You might be okay, but I would try to get some insider info on how AdWords Analytics companies get away with it. Might just be sneaky bots. A lot of idiots write code that pings or loads google.com to check for internet access so they can't block it completely.
The good news it's that it's reborn as an ad support version. Google doesn't want your dirty cash when your competitors are willing to pay much more for ad placement on your site.
If this was a post on the best practices to bot google I would understand it being frontpage material. Does anyone have any insight into this kind of botting?