Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Performing Google Search Using Python (geeksforgeeks.org)
94 points by mkalygin on June 6, 2017 | hide | past | favorite | 29 comments


I still can't understand how this made to frontpage and received this much upvotes. How it was difficult to perform a Google search and what golden key did this tutorial share?


Previously, one of the problems I've seen is Google obfuscates the source on the results page. You can't just grep for what you ​want since it's not there. They did this in a push to​ get you to use their search api, but for some odd reason, their search API returns different results than a browser. I find this very useful in that regard.


I tried this a couple of months ago. I could easily extract the url of the result using regex. However, this url was a redirect url with the domain google.com. At this point, I made one more request to this redirect url, followed the redirect chain and easily obtained the actual url of the result.


And google marks your ip as a bot. I did this three years ago in one of my DO proxies. Google still asks me to confirm a captcha when I do a search.


My company is behind a giant honking NAT and we get blocked all the time. Unusual volume for our "residential" IP block I suppose. Google seems to do some averaging across CIDR blocks to detect likely bots. Any chance somebody is abusing your proxy IP? Is your proxy part of an IP allocation thats underutilized?

An easy way to get around it is to login to a Google account. Google allows you to use even the most trashed and abused TOR exit nodes as long as you're logged in.


Indeed. If you're looking for programmatic web search, I'd suggest you go the Bing API route.


I commented earlier on this post about it being impossible for a bot to do search on Google. Turns out I turned to the same resource you're recommending. I'd also recommend sending a python bot to your favorite news sites once a day for updates instead. Same deal, Beautiful Soup. On the other hand, Google has Custom Search, which is $100 a year for 20k queries. I use that as well.


Google CSE seems like the most fitting solution, which appears to do pretty much the same thing as the article, but without all the hackery nonsense:

Step 1 - Setup a CSE to search entire web: https://support.google.com/customsearch/answer/2631040?hl=en

Step 2 - Use the CSE API: https://developers.google.com/custom-search/json-api/v1/over...


It looks like Google is shutting down the paid version of Custom Search Engine: "On April 1st, 2017, Google will discontinue the sales of Google Site Search, the paid version of Custom Search Engine. All new purchases and renewals must take place before this date. This product will be completely shut down by April 1st, 2018." Source: https://cse.google.com/cse/


If you want news, use RSS. Almost every news site is using it.


This depends on the rate at which you are making requests. If the rate is low, then Google doesn't bother. I have successfully scraped Google by waiting longer between requests.


    Traceback (most recent call last):
      File "0a51cb60301828feeaf66cbc908297ae.py", line 9, in <module>
        for j in search(query, tld="co.in", num=10, stop=1, pause=2):
    NameError: name 'search' is not defined
Edit: Maybe imports don't work.


You missed the

  Output:
    No module named 'google' found
below.

If they cut out this nonsense:

  try:
    from google import search
  except ImportError: 
    print("No module named 'google' found")
and just used

  from google import search
One would have gotten a sane error message:

  Traceback (most recent call last):
    File "00359ada2bfcd6dbabd1fa0207e683b8.py", line 1, in <module>
      from google import search
  ImportError: No module named google
catching ImportError when you have no fallback is a pointless thing to do.


First thing that jumped at me when I was reading the code.

Why would you wrap an import in a try catch unless you want to import something else when the first one fails.

And if you don't, you should exit ones the import fails rather than let it continue and hit inevitable failure.


Maybe it was written by programmers trained on compiled languages with no interpreter to automatically catch all exceptions and print a traceback for you, so now they're cargo culting onto "Every exception must get caught to avoid undefined behavior."


> You missed the

  Output:
    No module named 'google' found
I didn't miss that. I knew it wouldn't work. Just gave it a try :)


I'm not the author of the package, but it works for me. How do you import the package?

It should be `from google import search`.


Go to the link. Click Run on IDE, then click Run at the bottom of the new tab, see if you don't get the error :)

The code snippets may run on your device successfully, but I was testing the UI at the site.


You are right, it doesn't work on the site IDE. Looks like the google module has not been installed in their environment?


For security reasons, some people keep external source from running.


There is also this[1]. Used it around a year ago. Not sure if it still works though.

[1]https://github.com/rnikhil275/pygoogle


This is good work. I thought there was no way to do this since Google Search blocks bots.


They let some obvious bots through if you go slow enough. A lot of SEO and AdWords tracking companies still do unofficially tolerated automated google search.

I assume allowing some low amount of obvious botting prevents people from developing really sneaky bots that are much harder to block. It's probably to prevent an arms race that google may not be able to win.


What an incredibly good point.

That gives me hope, as someone who wants to do low-frequency data analysis that's only possible via multiple queries.


You might be okay, but I would try to get some insider info on how AdWords Analytics companies get away with it. Might just be sneaky bots. A lot of idiots write code that pings or loads google.com to check for internet access so they can't block it completely.


I redundantly made the following comment upon referring to this comment later in this thread:

Try Google Custom Search as well, $100 a year for 20k queries; I'm a client.


Good evening, I regret to inform you that Google custom search is dead.

http://fortune.com/2017/02/21/google-site-search-discontinue...

The good news it's that it's reborn as an ad support version. Google doesn't want your dirty cash when your competitors are willing to pay much more for ad placement on your site.


Google Site Search is dead, but Google Custom Search is not.


If this was a post on the best practices to bot google I would understand it being frontpage material. Does anyone have any insight into this kind of botting?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: