Performing Google Search Using Python

tarikozket · on June 7, 2017

I still can't understand how this made to frontpage and received this much upvotes. How it was difficult to perform a Google search and what golden key did this tutorial share?

shaklee3 · on June 7, 2017

Previously, one of the problems I've seen is Google obfuscates the source on the results page. You can't just grep for what you want since it's not there. They did this in a push to get you to use their search api, but for some odd reason, their search API returns different results than a browser. I find this very useful in that regard.

gh1 · on June 7, 2017

I tried this a couple of months ago. I could easily extract the url of the result using regex. However, this url was a redirect url with the domain google.com. At this point, I made one more request to this redirect url, followed the redirect chain and easily obtained the actual url of the result.

nurettin · on June 7, 2017

And google marks your ip as a bot. I did this three years ago in one of my DO proxies. Google still asks me to confirm a captcha when I do a search.

slackingoff2017 · on June 7, 2017

My company is behind a giant honking NAT and we get blocked all the time. Unusual volume for our "residential" IP block I suppose. Google seems to do some averaging across CIDR blocks to detect likely bots. Any chance somebody is abusing your proxy IP? Is your proxy part of an IP allocation thats underutilized?

An easy way to get around it is to login to a Google account. Google allows you to use even the most trashed and abused TOR exit nodes as long as you're logged in.

bake · on June 7, 2017

Indeed. If you're looking for programmatic web search, I'd suggest you go the Bing API route.

charred_toast · on June 7, 2017

I commented earlier on this post about it being impossible for a bot to do search on Google. Turns out I turned to the same resource you're recommending. I'd also recommend sending a python bot to your favorite news sites once a day for updates instead. Same deal, Beautiful Soup. On the other hand, Google has Custom Search, which is $100 a year for 20k queries. I use that as well.

bottled_poe · on June 7, 2017

Google CSE seems like the most fitting solution, which appears to do pretty much the same thing as the article, but without all the hackery nonsense:

Step 1 - Setup a CSE to search entire web: https://support.google.com/customsearch/answer/2631040?hl=en

Step 2 - Use the CSE API: https://developers.google.com/custom-search/json-api/v1/over...

defenestration · on June 7, 2017

It looks like Google is shutting down the paid version of Custom Search Engine: "On April 1st, 2017, Google will discontinue the sales of Google Site Search, the paid version of Custom Search Engine. All new purchases and renewals must take place before this date. This product will be completely shut down by April 1st, 2018." Source: https://cse.google.com/cse/

surfingdino · on June 7, 2017

If you want news, use RSS. Almost every news site is using it.

gh1 · on June 7, 2017

This depends on the rate at which you are making requests. If the rate is low, then Google doesn't bother. I have successfully scraped Google by waiting longer between requests.

happy-go-lucky · on June 6, 2017

    Traceback (most recent call last):
      File "0a51cb60301828feeaf66cbc908297ae.py", line 9, in <module>
        for j in search(query, tld="co.in", num=10, stop=1, pause=2):
    NameError: name 'search' is not defined

Edit: Maybe imports don't work.

justinsaccount · on June 6, 2017

You missed the

  Output:
    No module named 'google' found

below.

If they cut out this nonsense:

  try:
    from google import search
  except ImportError: 
    print("No module named 'google' found")

and just used

  from google import search

One would have gotten a sane error message:

  Traceback (most recent call last):
    File "00359ada2bfcd6dbabd1fa0207e683b8.py", line 1, in <module>
      from google import search
  ImportError: No module named google

catching ImportError when you have no fallback is a pointless thing to do.

ehsankia · on June 7, 2017

First thing that jumped at me when I was reading the code.

Why would you wrap an import in a try catch unless you want to import something else when the first one fails.

And if you don't, you should exit ones the import fails rather than let it continue and hit inevitable failure.

philipov · on June 7, 2017

Maybe it was written by programmers trained on compiled languages with no interpreter to automatically catch all exceptions and print a traceback for you, so now they're cargo culting onto "Every exception must get caught to avoid undefined behavior."

happy-go-lucky · on June 7, 2017

> You missed the

  Output:
    No module named 'google' found

I didn't miss that. I knew it wouldn't work. Just gave it a try :)

mkalygin · on June 6, 2017

I'm not the author of the package, but it works for me. How do you import the package?

It should be `from google import search`.

happy-go-lucky · on June 6, 2017

Go to the link. Click Run on IDE, then click Run at the bottom of the new tab, see if you don't get the error :)

The code snippets may run on your device successfully, but I was testing the UI at the site.

jimnotgym · on June 6, 2017

You are right, it doesn't work on the site IDE. Looks like the google module has not been installed in their environment?

happy-go-lucky · on June 7, 2017

For security reasons, some people keep external source from running.

whoami_nr · on June 7, 2017

There is also this[1]. Used it around a year ago. Not sure if it still works though.

[1]https://github.com/rnikhil275/pygoogle

charred_toast · on June 7, 2017

This is good work. I thought there was no way to do this since Google Search blocks bots.

slackingoff2017 · on June 7, 2017

They let some obvious bots through if you go slow enough. A lot of SEO and AdWords tracking companies still do unofficially tolerated automated google search.

I assume allowing some low amount of obvious botting prevents people from developing really sneaky bots that are much harder to block. It's probably to prevent an arms race that google may not be able to win.

i336_ · on June 7, 2017

What an incredibly good point.

That gives me hope, as someone who wants to do low-frequency data analysis that's only possible via multiple queries.

slackingoff2017 · on June 7, 2017

You might be okay, but I would try to get some insider info on how AdWords Analytics companies get away with it. Might just be sneaky bots. A lot of idiots write code that pings or loads google.com to check for internet access so they can't block it completely.

charred_toast · on June 7, 2017

I redundantly made the following comment upon referring to this comment later in this thread:

Try Google Custom Search as well, $100 a year for 20k queries; I'm a client.

slackingoff2017 · on June 7, 2017

Good evening, I regret to inform you that Google custom search is dead.

http://fortune.com/2017/02/21/google-site-search-discontinue...

The good news it's that it's reborn as an ad support version. Google doesn't want your dirty cash when your competitors are willing to pay much more for ad placement on your site.

primozk · on June 7, 2017

Google Site Search is dead, but Google Custom Search is not.

PokemonNoGo · on June 7, 2017

If this was a post on the best practices to bot google I would understand it being frontpage material. Does anyone have any insight into this kind of botting?