Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Better solution: pay target-site.com to start building an API for you.

Pros:

* You'll be working with them rather than against them.

* Your solution will be far more robust.

* It'll be way cheaper, supposing you account for the ongoing maintenance costs of your fragile scraper.

* You're eliminating the possibility that you'll have to deal with legal antagonism

* Good anti-scraper defenses are far more sophisticated than what the author dealt with. As a trivial example: he didn't even verify that he was following links that would be visible!

Cons:

* Possible that target-site.com's owners will tell you to get lost, or they are simply unreachable.

* Your competitors will gain access to the API methods you funded, perhaps giving them insight into why that data is valuable.

Alternative better solution for small one-off data collection needs: contract a low-income person to just manually download the data you need with a normal web browser. Provide a JS bookmarklet to speed their process if the data set is a bit too big for that.



Having been the victim of a VERY badly behaved scraper, I'm willing to listen to this. When that "attack" was going on, we talked about that very thing, if the scraper would only identify himself. (we were able to identify the actual culprit, and circumstantial evidence suggested they were going after our complete price list for a client)

The cost of the bad scraper was pretty significant. They were hitting us as hard as they could, through TOR nodes and various cloud providers. But the bot was badly written, so it never completed its scan. It got into infinite loops, and triggered a whole lot of exceptions. It caused enough of a performance drain that it affected usability for all our customers.

We couldn't block them by IP address because (a) it was just whack-a-mole, and (b) once they started coming in through the cloud, the requests could have been from legit customers. We eventually found some patterns in the bot's behavior that allowed us to identify its requests and block it. But I'd have been willing to set up a feed for them to get the data without the collateral damage.


Story doesn't add up.

First, it's very hard to pull off a DDOS attack using Tor. The most you could get would be less than someone repeatedly pressing refresh every second. This is because if you hit the same domain repeatedly the network will flag and throttle you.

How bad was your server configuration that it would choke if somebody tried to scrape it? Was this running on a dreamhost $10/year server or something? That's the only way to explain it's poor performance. Either that or your SQL queries are unoptimized anyways.

I'm just trying to understand. Unless this was like 10,000 scraper instances trying to scrape your website, I find it hard to believe this story.

Instead of downvoting, why don't you offer rebuttal to what I wrote and post more evidence to support your original story?


> This is weird. First, it's very hard to pull off a DDOS attack using Tor. The most you could get would be less than someone repeatedly pressing refresh every second.

Please explain. Why do you think Tor can't provide a user with many RPS?


The network as I understand will automatically throttle and flag you if you are firing too many RPS. If you are hitting a particular domain over and over especially. So it's not possible to take down websites with TOR unless it's running on Dreamhost's shared hosting plan with a PHP solution.

This is why I find OP's story hard to believe, it doesn't add up.


Tor does nothing of the sort. In order to throttle a client, there would need to be a central authority that could identify connections by client, which would very much defeat the purpose of Tor. And besides, how would it deal with multiple Tor clients for the same user?

That said, it's not particularly effective a as a brute-force DoS machine due to the limited bandwidth capacity and high pre-existing utilisation. Higher level DoS by calling heavy dynamic pages is still possible.

The parent didn't specify that the outages were during the period that the scraping was coming from Tor. It's equally possible that it only started affecting availability after they blocked Tor and switched to cloud machines.

All that said, screw people who use Tor for this kind of thing. They're ruining a critical internet service for real users.


CWuestefeld wrote "They were hitting us as hard as they could, through TOR nodes and various cloud providers."

I think you missed the second part of that sentence. I must admit that I worked for a company that did that to scrape a well known business networking site....


> Better solution: pay target-site.com to start building an API for you.

I'd add to this:

Do you really need continuing access? Or just their data occasionally?

Pay them to just get a db dump in some format. For large amounts of data, creating an API then having people scan and run through it is just a massive pain. Having someone paginate through 200M records regularly is a recipe for pain, on both sides.

A supported API might take a significant amount of time to develop, and has on-going support requirements, extra machines, etc. Then you have to have all your infrastructure or long running processes to hammer it and get the data as fast as you can, with network errors and all other kinds of intermittent problems to handle.

A pg_dump > s3 dump might take an engineer an afternoon and will take minutes to download, requiring getting approval from a significantly lower level and having a much easier to estimate cost of providing.


> pay target-site.com to start building an API for you.

When has that ever worked?


I can't cite specific examples because the ones I know about formed confidential business relationships, but I can say with confidence that this works All. The. Time.

That said, if you're some small-time researcher who can't offer a compelling business case to make this happen, then it won't be worth their time and they're likely to show you the door. [Note: my implication here is that it's not because you're small time, but it's because by the nature of your work you're not focusing on business drivers which are meaningful to the company/org you're propositioning].

Edit: Also be warned that if you're building a successful business on scraped personal info, you're begging to be served w/ a class action lawsuit (though take that well-salted, because IANAL and all that jazz).


Most of the time, in my (admittedly limited) experience. The two exceptions have been:

A giant site, who already had an API & had deliberately decided to not implement the API calls we wanted. I should add that another giant site have happily added API options for us. (For my client, really; not for me.)

An archaic site. The dev team was gone & the owners were just letting it trickle them revenue until it died -- they didn't even want to think about it anymore.


When the money is good enough :) . That is, usually not for startups, yes for established companies with money.


Never to my personal attempts


Many instances of us building scrapers are cases where a partner has data or has tools which are only built into the UI or the UI ones are much more capable.

Rather than waiting potentially years for their IT team to make the required changes we can build a scraper in a matter of days.


I can attest to this. From personal experience I found websites that would ignore scrappers and just allow me to access their data on their public web-site easier to deal with code wise and time wise. I make the request, you give me the data I need and then I can piss off.

Web-sites that make it a cluster &&*& to get access to the data do two things. They setup a challenge to break their idiotic `are you a bot?` and secondly it is trivial in most situation just to spin up a vm, and run chrome with selenium and a python script.

Granted I don't use AJAX API or anything like that. Instead I've found developer who nativly have a JSON string along side the data within the HTML to easiest to parse.

Reasons why I've setup bots/scrappers 1) My local city rental market is very competitive and I hate wasting time email landlords who have all-ready signed up a lease. 2) House prices 3) Car prices 4) Stock prices 5) Banking 6) Water Rates 7) Health insurance comparison 8) Automated billing and payment systems.


>Alternative better solution for small one-off data collection needs: contract a low-income person to just manually download the data you need with a normal web browser. Provide a JS bookmarklet to speed their process if the data set is a bit too big for that.

This is a good idea that has some interesting legal implications (e.g., the target site's network is never accessed by the software, so CFAA claims are likely irrelevant), but probably isn't enough to cover all the bases. I wanted to try something like this before I got C&D'd, but my lawyer informed that doing it after the fact could potentially constitute conspiracy and cause a lot of problems.

I'm not a lawyer.


My son has done a few of these for me. :) Smaller sites, one time grabs. But yes, persisting AFTER a direct request that you stop is usually both uncool and legally risky.


This can't be a solution for people using web scraping:

The goal for web scrapers is to pay as little as possible for as much data as possible.


Depends on the scraper. I buy data dumps when I can, if possible. Plus, it can actually be cheaper to enter into a business relationship with the target site than it is to play whack a mile with their anti scraping development efforts over time.


Also known as the tcgplayer.com strategy. Very disappointing to find out about, especially when the margins on hobby-level Magic: the Gathering card selling are already so low.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: