This is the first I've heard of this. Where/what laws prohibit web scraping?

jacquesm · on March 16, 2017

cookiecaper · on March 16, 2017

It's almost always illegal in the United States. It's prohibited by a combination of the CFAA, copyright law, and contractual obligations imposed by Terms of Use, which are usually considered applicable if you load more than one page ("browsewrap").

The CFAA makes it a crime to access any computer network without authorization or in excess of granted authorization. The Terms of Use will usually prohibit "any automated or mechanical access" or use similar boilerplate that can be construed as a restriction on automated access. The implied license to make a copy of the page in RAM is no longer applicable and the scraper is thus infringing copyright.

Relevant cases are Craigslist v 3Taps, Facebook Inc. v Power Ventures Inc., and several others. This is at the point where it's basically well-established. The exception is Perfect 10 v. Amazon, where judges ruled that since it was Google and they don't want to break Google, it's OK. Copyright law allows such evaluations because each judge must decide whether a use was "fair" or not.

codedokode · on March 16, 2017

Doesn't publishing information on a public web server equals to granting authorization to download it? Why publish it otherwise?

And copyright laws are supposed to protect only creative works, not every page on the web.

cookiecaper · on March 16, 2017

>Doesn't publishing information on a public web server equals to granting authorization to download it? Why publish it otherwise?

This is the argument that there is an implied license. The counterargument is that the user agreed to the Terms of Use which explicitly defined automated access as violative. In addition, these cases generally begin with a cease and desist demand letter, which explicitly informs the allegedly-infringing party that the publisher believes their rights are being violated and that they must cease and desist immediately. If the TOS argument doesn't hold, the C&D will surely qualify as a revocation of any implied license to access the content for copyright purposes. It generally also serves as explicit notice that the publisher considers the accessor to be "exceeding authorized use" of their computer systems, which is a crime under the CFAA. In Craigslist v 3Taps, the judge also commented that needing to circumvent IP bans should've made it obvious that 3Taps was "exceeding authorized access" under the CFAA.

>And copyright laws are supposed to protect only creative works, not every page on the web.

Copyright law protects all works of sufficient originality. Pretty much the only thing it doesn't protect is a plain list of facts (and in the European Union, it even protects that, known as "database rights"). The minimum standard of originality for copyright protection applies to practically every page on the web, yes.

In effect, this means that you can copy a list of names and addresses from a phone book, but you can't copy the layout. Since you can't access a web page without making a copy in RAM, if the publisher has revoked your license to access the content, even accessing and extracting the raw factual information within the body of the page is an infringement (because your RAM copy is an infringing copy).

IANAL and this is based on my layman's understanding.

brilliantcode · on March 16, 2017

ToS is not a legally binding mutual agreement between the website and the scraper. Only when it's materialized in the form of C&D and it caused damage to the website. In the QVC vs Resultly, they caused a website crashed. In the Craigslist vs 3Taps, the damages were non-existant but they still won because of the C&D letters.

So if a website sends you a letter to stop, do not continue scraping them. Until that happens, it's fair game. ToS is not legally binding. Those two cases were the result of damage caused to the website.

STOP. SPREADING. FUD. cookiecaper. All of your comments are the same unsound legal advice yet Mozenda, Import.io, and a whole bunch of tools & service providers are humming along just fine.

Disclaimer: I'm not a lawyer, this is not a legal advice, consult a real lawyer and not random HN comments.

cookiecaper · on March 16, 2017

>In the Craigslist vs 3Taps, the damages were non-existant but they still won because of the C&D letters.

(Technically they settled.)

Read up on that case and you'll see that even absent a C&D, the judge reasoned that Craiglist's IP ban against 3Taps was a separate incident of affirmatively communicating its intention that 3Taps refrain from accessing the site:

    > The calculus is different where a user is altogether banned from accessing a website.
    > The banned user has to follow only one, clear rule: do not access the website. The notice
    > issue becomes limited to how clearly the website owner communicates the banning. Here,
    > Craigslist affirmatively communicated its decision to revoke 3Taps’ access through its ceaseand-desist
    > letter and IP blocking efforts. 3Taps never suggests that those measures did not
    > put 3Taps on notice that Craigslist had banned 3Taps; indeed, 3Taps had to circumvent
    > Craigslist’s IP blocking measures to continue scraping, so it indisputably knew that Craigslist
    > did not want it accessing the website at all.

The judge continues to suggest that using proxies at all is atypical and may demonstrate an intention to violate the CFAA. Ruling at http://www.volokh.com/wp-content/uploads/2013/08/Order-Denyi... .

>So if a website sends you a letter to stop, do not continue scraping them. Until that happens, it's fair game. ToS is not legally binding. Those two cases were the result of damage caused to the website.

No, that's incorrect. You can believe this all you want, and I truly hope that it all goes well for you. It's totally possible that you will never piss off someone who has the resources to file a lawsuit over it. But you should know that you can be bound by a browsewrap agreement (and that it's quite easy to be bound by a clickwrap agreement, which, in practice, may not be much different and which scrapers may automatically follow (e.g., a link that says "click here to enter and agree")).

It's really going to come down to the judge's belief that the notification is adequately prominent and that a "reasonably prudent user" would be aware of the stipulations.

Quoth from Nguyen v. Barnes and Noble (citations removed):

    > where, as here, there is no evidence that the website
    > user had actual knowledge of the agreement, the validity of
    > the browsewrap agreement turns on whether the website puts
    > a reasonably prudent user on inquiry notice of the terms of
    > the contract. [...] Whether a user has inquiry notice of a 
    > browsewrap agreement, in turn, depends
    > on the design and content of the website and the agreement’s
    > webpage. Where the link
    > to a website’s terms of use is buried at the bottom of the page
    > or tucked away in obscure corners of the website where users
    > are unlikely to see it, courts have refused to enforce the
    > browsewrap agreement. [...] On the other hand, where the website contains an
    > explicit textual notice that continued use will act as a
    > manifestation of the user’s intent to be bound, courts have
    > been more amenable to enforcing browsewrap agreements.
    > [...] In short, the conspicuousness and
    > placement of the “Terms of Use” hyperlink, other notices
    > given to users of the terms of use, and the website’s general
    > design all contribute to whether a reasonably prudent user
    > would have inquiry notice of a browsewrap agreement.

Full decision at https://d3bsvxk93brmko.cloudfront.net/datastore/opinions/201... .

>STOP. SPREADING. FUD. cookiecaper. All of your comments are the same unsound legal advice yet Mozenda, Import.io, and a whole bunch of tools & service providers are humming along just fine.

I'm not giving any legal advice as I'm not a lawyer. For the third or fourth time here, this is all according to my layman's understanding. It's based on things I learned that time I had to close my business or face a lawsuit from a massive company over just such issues.

It's crucial for companies that provide scraping services to be aware of these issues and I know of at least one such company who is aware of them and who takes several precautions to provide some distance from potential legal liability, though they are still not 100% out of the woods. As is usually required in entrepreneurship, they're taking a calculated risk. Should someone wage a legal challenge against their activity, they have millions of dollars in the bank from investors who presumably have researched this and are willing to accept the cost of the potential legal liability.

I'm not saying that people shouldn't make businesses that depend on scraping data. I just think they should know what they're getting into before they do so.

You are correct that some businesses have been able to engage in such activities without being sued out of existence up to this point. Unfortunately, that doesn't mean that others will be as lucky.

I fully agree that anyone who is seriously interested/concerned about this should ask a lawyer. I certainly did. Their answers were not good news for me. Maybe they will be for you.

brilliantcode · on March 16, 2017

So the answer is don't get IP banned. That's easy to solve.

cookiecaper · on March 16, 2017

That's potentially an answer if the judge decides that the browsewrap notice was not sufficiently conspicuous to constitute a binding agreement, etc.

I would guess that most judges would not be charitable to someone pretending that they've circumvented this by rotating through proxies pre-emptively. In fact, this would likely work against the defendant as it'd be evidence of willful infringement, which is typically 3x damages. If you can convince the judge that you were just incidentally rotating IPs to protect privacy or something, you might get away with it, but it's definitely not a simple answer, and that still only gets you up to the point of receiving a C&D.

I'm not sure why you're trying so aggressively to mislead people about the legal precariousness of data scraping, but at this point it should be clear that this is not a simple matter and it's not something to approach lightly or dismissively.

brilliantcode · on March 16, 2017

Well, looks like all the HN user's of Scrapy better lawyer up because Scrapy Cloud offers exactly that, as do rest of the web scraping vendors like Mozenda out on the market. They've all been around for 10+ years, doesn't seem like this is an issue for them.

cookiecaper · on March 16, 2017

Scrapinghub has several proactive/preventative restrictions on the sites they'll allow users to access because they're trying to avoid such liability. They've been successful up to this point and that's great. That doesn't mean that what they're doing is not a legal grey area.

For scraping-related activities, Scrapinghub would probably be the party sued, as was the case in 3Taps, though the clients could probably also be legitimately sued for various things, most obviously copyright infringement.

Again, I'm really not sure what you're getting at here. Yes, it's a great idea to check with a lawyer and assess your potential legal exposure. That's why lawyers exist! You can then ask them questions, as Scrapinghub surely has, about how to minimize that potential legal exposure. You definitely SHOULD do that, especially since scraping is more or less illegal in the United States.

Courts frequently use an analogy to private physical property to address the matter of accessing a web site. Running a business based on scraping until someone sends you a C&D is roughly the same as running a business based on trespassing on private property until someone serves you with a no-trespass order.

Maybe it will work out fine, and most of the time, as long as you leave the property promptly upon request, you probably won't have an issue just because there's no benefit in dragging the matter out further. But that doesn't mean there isn't legal risk involved in running such a business, nor does it mean that you won't be liable for damages incurred whilst trespassing.

In such a case, questions about whether the borders of the property were clearly delineated, whether "No Trespassing" signs were posted, whether a reasonable person would've understood they weren't allowed to be there or not, etc., would be asked to determine the existence and/or extent of the trespasser's liability.

In the same manner, there is substantial risk involved in running a business whose primary function is to scrape websites, and the same types of questions would be (are) asked in a court case related to network access. People deserve to be informed of that.

That's not FUD, it's just the law. If you don't like it, well, most people who know what they're talking about don't either, but that doesn't change the law. Saying "$Party_X hasn't been sued over it!" also doesn't change the law or make the process any less legally risky.

If you find this arrangement unsettling or absurd, as you obviously do, I would suggest that you direct your energies/attention to your local representatives, the EFF, and other types of political activism that may help rectify the situation rather than accusing HN commenters of spreading FUD.

When you do something illegal, you probably won't get sued for it, because it costs a ton of money to sue someone and it's not likely that you're annoying anyone enough to justify that. This is especially the case if you back off at the first sign of annoyance. That's as much as we can say for your angle.

If you're comfortable basing a business on that, be my guest.

brilliantcode · on March 16, 2017

The more I read your comment, the less I'm worried. It's clear that you are not a lawyer but someone just overly reacting to perceived legal liabilities by simply generalizing court cases and attempting to reach a conclusion that tries to fit everyone.

Businesses that utilize web scraping to achieve business goals at a direct expense of another business will get you in trouble not because of web scraping but simply trying to create competition. Businesses with a large cash use litigation to snuff out competition because their businesses are largely undefensible without such forceful litigation ex. craigslist would not exist if they let anyone scrape them.

Businesses that build and sell web scraping sevices and tools are less likely to be impacted for the same reasons if they comply with formal requests to stop scraping. 3Taps received notices beyond just IP ban (this alone does not set enough of a context) but they chose to ignore it and continue on. 3Taps had enough of a financial motivation on the line to put out their neck for their customer, PadMapper. Pretty fucking stupid if you ask me, no one customer is worth risking the entirety of your business operation.

It's far more likely that the law exists to serve those who exploit it to protect their business interests. Generalizing and extrapolating based on a few court cases with their own dynamic set of variables and exceptions as fact is dangerous advice.

I just want to warn people reading your comments not to take it word for word as the reality is far far less legally hostile-you are too small for people to go after and not an existential threat to the target website.

The argument that web scraping puts strain on web servers is a pretty laughable defense. Craigslist alone gets millions of hits every day but can't serve pages requested by a python script? 3taps fucked themselves because they took money AND they put their neck out for their customer.

That's the lesson here, don't risk your entire business for one customer. It's not fair to the rest of your customer base.

cookiecaper · on March 16, 2017

>It's clear that you are not a lawyer

It should be, because I've stated it probably 6 times in this thread.

>someone just overly reacting to perceived legal liabilities by simply generalizing court cases and attempting to reach a conclusion that tries to fit everyone.

So for the seventh time, I'm not a lawyer, but isn't this how it works when questions about legality are posed? It's always based on the relevant statutes and the case law interpreting and applying those statutes. I mean, correct me if I'm wrong.

I'm glad you're not worried about someone looking at the case law and making a generalization about how it applies to the field.

If you want specific (i.e., non-generalized) legal information, you always need to discuss your individual affairs with a licensed attorney who is knowledgeable in the field and jurisdictions in which you'll be operating.

In practical terms, web scraping is usually illegal in the United States. In this case, that doesn't mean there's a law that says "web scraping is illegal", it means that there is a small group of laws, which, taken together, make it virtually impossible to scrape web pages with confidence that you're not getting exposed to potentially serious legal liability. Note that "illegal" is not the same as "criminal", but that the CFAA does provide for criminal penalties (and Aaron Swartz was being prosecuted under them for scraping research papers out of an academic database).

>Businesses that utilize web scraping to achieve business goals at a direct expense of another business will get you in trouble not because of web scraping but simply trying to create competition.

You're talking about the likelihood that a business will get sued by someone. That's great, but it doesn't change the legal status of the activity that someone is unlikely to sue you for.

My business did not directly compete with anyone. Everyone thought it primarily helped the data sources we used. People always told me that they were shocked that the company that was making the threat was upset about it. Even my lawyer said it seemed unusual and couldn't figure out what their underlying motive was.

The stakes are an important consideration, but yes, it is important to consider the impact if you do get sued/threatened by an unlikely plaintiff.

>3Taps received notices beyond just IP ban (this alone does not set enough of a context)

The 3Taps ruling casts doubt on the suggestion that an IP ban is itself insufficient notice. That issue hasn't been decided directly afaik, but the reasonable conclusion, if you are getting a 403 or a page that explicitly informs you your IP has been banned when you access a site, is that they are trying to keep you out and that further access likely violates the CFAA.

>3Taps had enough of a financial motivation on the line to put out their neck for their customer, PadMapper. Pretty fucking stupid if you ask me, no one customer is worth risking the entirety of your business operation.

That's definitely the risky side of the equation. The alternative side was that they'd win and be allowed to retain access to one of the largest data sources on the internet, and preferably set a precedent that allowed them to continue to scrape big data sources without concern moving forward. That gamble clearly did not pay off for them, but that doesn't mean it wasn't a reasonable gamble to take.

>It's far more likely that the law exists to serve those who exploit it to protect their business interests.

I agree, but I don't see how it's relevant. Lots of people believe that it's beneficial to their business interests to use the legal system to bully people who can't afford to stand up for themselves. Uh, congrats to them I guess? Why are you saying this like it's a normal thing? We should take steps to minimize the surface area that can be used for that.

If you're suggesting there is a small handful of bad guys to whom these laws need to apply, that's fine and I actually agree with you, but that means we need to fine-tune the law so that it only covers the bad guys, not virtually everyone if someone you're scraping is having a bad day.

You keep fighting this fight pretending like I'm saying something that's incorrect, and then you just come back and say that it doesn't matter because a) some people who scrape have not been sued; and b) people who start scraping business may not get sued if they adhere to the requests of those who politely ask them to stop. That's great, but it's neither here nor there. This is about what the law is, not whether you're going to be sued personally.

>Generalizing and extrapolating based on a few court cases with their own dynamic set of variables and exceptions as fact is dangerous advice.

It's all anyone can do when you're dealing with an emerging area of law, afaik.

>I just want to warn people reading your comments not to take it word for word as the reality is far far less legally hostile-you are too small for people to go after and not an existential threat to the target website.

Yes, this is another thing I've stated multiple times. You probably won't get anyone mad enough at you to sue you. But you should know where you stand if you do. And you should try to fix the law in the meantime.

>The argument that web scraping puts strain on web servers is a pretty laughable defense.

Plaintiffs use this argument all the time and get injunctions filed on that basis regularly. Even if the defendant is not disruptive, judges say they need to issue the injunction or it will invite a pile-on effect that will be disruptive. Thus, they grant an injunction under a trespass to chattels doctrine, generally putting legal force behind a C&D.

>3taps fucked themselves because they took money AND they put their neck out for their customer.

3taps fucked themselves only because they tried to stand up and win the case. Perhaps it would've been better for them to try to lobby Congress instead and get the law transformed into something semi-reasonable, though it's likely they recognized the futility in that.

>That's the lesson here, don't risk your entire business for one customer. It's not fair to the rest of your customer base.

It seems like the lesson is that web scraping is legally precarious, and that if you're not careful about it, you can end up in a lot of hot water.

You keep acting like that's an absurd conclusion, but not really showing anything to discount the onerous outcomes that entrepreneurs in this space have faced. 3Taps is not the only case where this has been addressed.

In Facebook v. Power Ventures, the corporate veil was pierced and the entrepreneur was left with $3 million in personal liability, all for trying to create software that made it easy for a user to save their own data only out of Facebook. Facebook acknowledged that it did not have any copyright interest allowing it to forbid Power from accessing that data specifically, but they continued to pursue copyright claims based on the RAM copy of the Facebook site from which the content was extracted.

The point is that the current law makes scraping a perilous exercise. Perhaps you won't have problems, but that's probably only the case if a) you stay so small no one will ever target you or b) you know the law and you take extra precautions to protect your business so that any accusations of wrongdoing are clearly invalid against current law. Scrapinghub is trying to do this, but IMO it's insufficient if they get an aggressive/hostile litigant.

The truth is that Scrapinghub et al are on the precipice and they're going to stay there until precedent changes (likely through a SCOTUS override, particularly one overturning the RAM copy doctrine, which is probably plausible, and one putting constraints on the ability to revoke access to public web sites under the CFAA, which is probably not) or until the law changes. They only need to get hit with one well-placed lawsuit and they'll be goners.

You can argue til the cows come home about how they won't get sued because they stop once they get a C&D, but that's not necessarily true, and that doesn't fix the laws around scraping.

codedokode · on March 16, 2017

I think that is wrong too. I think that copyright laws should make a distinction between making and distriubuting a copy by a person (for example uploading copyrighted file to a website) and technical processes that happen inside a computer. Copying something from NIC buffers to memory should not be "copying" under copyright law.

cookiecaper · on March 16, 2017

The law tries to make this distinction by specifying that copies have to be fixed into a tangible medium to be infringing. The problem is that the "RAM copy doctrine", as it's known, states that RAM is a sufficiently fixed copy into a sufficiently tangible medium to qualify. This doctrine has been used against scrapers repeatedly, as in Ticketmaster LLC v. RMG Technologies, Inc. (https://casetext.com/case/ticketmaster-llc-v-rmg-technologie...) :

    > The copies of webpages stored automatically in a computer's cache or 
    > random access memory ("RAM") upon a viewing of the webpage fall within 
    > the Copyright Act's definition of "copy." See, e.g., MAI Systems Corp. 
    > v. Peak Computer, Inc., 991 F.2d 511, 519 (9th Cir. 1993) ("We recognize 
    > that these authorities are somewhat troubling since they do not specify 
    > that a copy is created regardless of whether the software is loaded into 
    > the RAM, the hard disk or the read only memory (`ROM'). However, since 
    > we find that the copy created in the RAM can be `perceived, reproduced, 
    > or otherwise communicated,' we hold that the loading of software into 
    > the RAM creates a copy under the Copyright Act.") See also Twentieth 
    > Century Fox Film Corp. v. Cablevision Systems Corp., 478 F.Supp. 2d 607, 
    > 621 (S.D.N.Y. 2007) (agreeing with the "numerous courts [that] have held 
    > that the transmission of information through a computer's random access 
    > memory or RAM . . . creates a `copy' for purposes of the Copyright Act," 
    > and citing cases.) Thus, copies of ticketmaster.com webpages 
    > automatically stored on a viewer's computer are "copies" within the 
    > meaning of the Copyright Act.

TeMPOraL · on March 16, 2017

That's a very, very sad turn of events, and I have to wonder, how did we get there?

I'm increasingly feeling that the law is giving way too much control over content published on the Internet to the publishers.

cookiecaper · on March 16, 2017

I agree. There is a lot more fairness in physical space that doesn't translate to cyberspace primarily due to the implementation details of computers and networks. Whereas products and machines built in the real world are primarily protected by things like patents and trade secrets, practically everything in the digital world falls under uber-restrictive copyright protections, since the "creative" work of code and its compiled/interpreted derivatives is the language by which everything is implemented.

Similarly, concepts like the "first sale doctrine" are becoming less applicable with digital delivery, as it's impossible to identify a "hard copy" of something that may be eligible for resell. That completely obliterates the secondary market for many products that are accessed through computers, including software, games, movies, and books.

The CFAA essentially allows network operators to arbitrarily make someone a felon overnight. Reddit co-founder Aaron Swartz is the most prominent example of this; his criminal prosecution under the CFAA (for scraping publicly-funded research papers out of a database) was pending when he committed suicide.

We badly need digital rights reforms, but since major companies have been allowed to profit handsomely off these shifts and since they find it rather convenient to bully small innovators with serious legal threats, which are easy to craft in this climate, it doesn't seem that anyone is making this a priority.

tyingq · on March 16, 2017

There's even a very highly specific (online ticket sales) bill that passed congress: https://www.congress.gov/bill/114th-congress/senate-bill/318...