Though I think that training data is totally a violation of copyright, OpenAI really needs to win. Copyright has been unreasonably extended to the point where it's untenable and if we needed to track down rights holders and negotiate 'training rights' we'd ensure there would be no open models or competition going forward. The rights holders are going to lose this one and it's probably for the better.
Ultimately it's gonna come down to new legislation, and there is just absolutely no chance of "big tech gets to exploit every copyrighted work on the planet for free". Definitely not in the US or Europe.
Big media companies don't want that, zillions of small creators don't want that, the public at large has no great affection for giant tech companies getting even more power.
Big Media isn't worried about you drawing Shrek or Donald Duck at home by hand or by GPT. They won't like you distributing it profitably, but they already have tools against that. They'll also like few billions coming their way but that everyone likes.
Small Creator are worried about getting the next gig as AI may take away low end of the jobs, and they are not worried about copyright implications of the LLMs.
General Public doesn't like Tech companies but loves the products they make, nobody stopped using early Youtube or Telegram because there was piracy happening.
Isn’t that what “jury nullification” is? Deciding that a law is trash and you’re just going to do whatever instead? I realise this requires a jury which isn’t the case for everything.
And what do you think the consequence will be if Open Ai loses?
A: Big media companies will make billions from licensing, AI will only be produced by a few companies that can afford the licenses and creators will get an annual $10 check (see Spotify).
You forgot the last point, which is that creators who don't want their work used in the training data for these megacorp LLMs without permission will get what they want.
Try training an "open" model on Nintendo's, Disney's, and Elsevier's IP and see how long it takes them to bury you in lawsuits citing copyright infringement. The only way out of this would be to abolish copyright.
B: Development of AI accelerates. Necessity being mother of invention and all that. After all a human doesn't need to ingest every written word out there to show signs of intelligence.
"training data is totally a violation of copyright"
This really isn't clear because cognition is treated as a special exception to copyright. Every thought we have is derivative of everything we've seen before to some degree; reading a book makes our brains a derivative work. But we recognize that cognition is special.
With machines we tend to apply a strict test: Did copyright go in? If so, the output is almost certainly derivative.
With human brains, with cognition, it isn't enough to prove that a person has consumed a copywitten work prior to having a thought -- instead we judge every thought individually as to its originality.
If we are in a position to apply similar cognitive rules to an LLM then the weights won't be derivative works and we will judge each output as to its originality rather than simply assume.
"This really isn't clear because cognition is treated as a special exception to copyright."
Actually, no. It's considered a transformative use. If you memorize a copyrighted play or piece of music and then perform in in public, that's a copyright violation. It's the literalness of the copy that matters.
No, that's totally incorrect, we do not consider every observation a "transformative use" as applied to the human mind. If you memorize a copyrighted play and write another play it is NOT inherently a copyright violation of everything which has come before. We just don't do that.
The new play is judged as to its originality.
People who have seen a play (everybody) are allowed to write new plays which aren't beholden to the copyright of the first play they've ever watched.
>> "training data is totally a violation of copyright"
> This really isn't clear because cognition is treated as a special exception to copyright.
Human cognition; not the latest algorithms and their output, which some enthusiastic software engineers eagerly confuse for cognition. It's actually pretty clear.
> The open question is how to handle machines that mimic the process.
It's not really an open question, except for software engineers who've talked themselves into thinking of humans as computers. A machine is not a human mind, so does not benefit from the legal exceptions and rights granted to the latter.
I remarked on how human cognition is treated as a magical process with respect to copyright law.
This is just a legal fact. It has nothing to do with how an LLM operates internally, or whether an LLM is at all similar to a human mind in terms of internal mechanics.
> "The legal question of does "copyright goes away if your violation is big enough?"
1) no similarities have ever been demonstrated between large language models and human cognition, and until that happens (spoiler: never) there is no basis in comparing them like this.
2) even if they were somehow proven to be the same there is still no reason why the same standards need to be applied to computer programs and humans because computer programs do not have any rights or legal protections.
3) cognition is not a "special exception to copyright" because it is entirely unrelated. "Copy" "right" is who has rights to make copies. Your thoughts are not considered copies because they are intangible.
4) we do not "judge every thought individually as to it's originality" because other peoples' thoughts are entirely opaque. Nobody is judging your thoughts, and if you think they are you need to take your medications.
"1) no similarities have ever been demonstrated between large language models and human cognition"
This is false. The LLM's entire purpose is to mimic cognition.
You could argue that the operation differs in important ways - of course. But the similarity of output is literally the entire point.
"2) even if they were somehow proven to be the same"
I didn't suggest they need to be the same, proven or otherwise. I think you're not understanding. The point is that the function is similar.
How it works doesn't necessarily matter.
"3) cognition is not a "special exception to copyright" because it is entirely unrelated. "
False as a matter of law.
"4) we do not "judge every thought individually as to it's originality" because other peoples' thoughts are entirely opaque."
Also false as a matter of law. When you publish your thoughts - your works, writing, whatever they are judged as to their originality if the question of who owns the copyright is raised.
"Nobody is judging your thoughts, and if you think they are you need to take your medications."
There's no need to be snarky and disingenuous.
From the comment guidelines: Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.
>This is false. The LLM's entire purpose is to mimic cognition.
Purpose and mechanism are not the same thing. "Similarity of output" does not make it equivalent.
>I didn't suggest they need to be the same, proven or otherwise. I think you're not understanding. The point is that the function is similar.
Sure, go ahead and ignore all but half a sentence and then accuse me of missing the point.
>False as a matter of law.
Show me the court case where somebody was found to have violated copyright law by thinking about something.
>When you publish your thoughts
You don't publish your thoughts. You publish essays, internet comments, articles, videos, etc based on what you are thinking and those are subject to copyright law.
>There's no need to be snarky and disingenuous.
How dare you, i would never disingenuously tell somebody who thinks his thoughts belong to other people to take their psychiatric medications. Of course i did mean that they should be prescribed by a licensed physician and looking back i regret not stating that explicitly.
"The LLM's entire purpose is to mimic cognition." is your counterpoint to me saying that no peer-reviewed source has ever demonstrated a similarity between LLMs and human cognition. I'm talking about mechanism and you're talking about purpose.
Thank you for saying what I was going to say to this person. I'm so fucking tired of seeing people who probably have never opened a neuroscience textbook talk about cognition.
Did OpenAI buy a copy of every book in The Pile? Or did they all just fall off the back of a truck?
Aside from that question, I tend to agree with the judge - LLM outputs are obviously not derivative works or copies in any sense that we normally use the phrase.
OpenAI hasn’t trained on The Pile, as far as I know. I think you mean "Did Meta buy a copy of every book in books3?" since llama wasn’t trained on The Pile either. And the answer is no.
It seems important for the answer to remain no, otherwise the only entities that can afford to train on sufficient numbers of books will be big companies. No one can afford 190,000 books except huge corporations, so we’ll be surrendering our ability to train our own useful models if the price of training is that high.
190k books isn’t even that much. OpenAI probably trained on millions.
If the only issue is paying for a copy, thinking out loud, It would be interesting if some ebook lender like libby let you queue up a whole bunch of books to borrow, train on the return it.
Copyright laws need to be reformed but I don't understand why we need to just completely upend the concept of Copyright just because it's inconvenient to these massive tech companies that want to hoover up all the data they can with as little interference and COST to them as possible. Let's not pretend these are some small startups by two employees working out of a garage who took out reverse mortgages on their homes.
I can see the argument it might reduce open models, but how is asking for permission (and maybe sometimes paying) of the rights holders going to reduce competition? It would be a level playing field. If you're a company that's working on a LLM, you have to ask permission to ingest data that doesn't belong to you. All companies working on an LLM would have to follow these rules. So how is it anti-competitive?
> Though I think that training data is totally a violation of copyright, OpenAI really needs to win.
I'm not sure how to interpret this line. By "violation of copyright" are you referring to the ideal copyright law that's in your mind but doesn't exist yet, or are you referring to the copyright statutes and case law that are currently in effect? And in your opinion is OpenAI violating the latter, the former, or both?
> The only claim under California's unfair competition law that was allowed to proceed alleged that OpenAI used copyrighted works to train ChatGPT without authors' permission. Because the state law broadly defines what's considered "unfair," Martínez-Olguín said that it's possible that OpenAI's use of the training data "may constitute an unfair practice."
Yeah agreed, and also the general direction this is heading in feels a bit strange to me - we can debate the merits of current copyright law - but the indisputable fact is that the training material came from *pirated* copies of these author’s copyrighted works.
If we are to believe the much used arguments against piracy that we’ve been fed over the last 30+ years (mainly by deep pocketed media companies) then piracy is stealing and if you take stolen material and use it to produce a profitable commercial service then surely that’s a clear cut case in favour of the copyright owners?
Otherwise WTF have we been prosecuting individuals for all these years for pirating movies and music etc?
And here’s the injury: the authors did not get paid for their copyrighted work. Whether you agree with the law or not, OpenAI at the very least should have bought a copy of every book they trained on - and that doesn’t even get into the question of whether they even had the right to train on that material even if they had paid for it. They didn’t do that. They used stolen copies instead.
I don’t understand how it could go any other way. I do get the judge’s reluctance to decide that every bit of output of ChatGPT constitutes copyright infringement - that’s much harder to argue.
I’m not passionately against OpenAI - I pay them money - but I am concerned about this “training material” sourcing issue just being swept under the rug.
AI advancements clearly do impact artists, authors and other creators of works… they’ll be even more screwed if we just throw copyright law completely out of the window at the same time.
Fair use is non-transitive: you reviewing a pirated copy of a movie can be fair use even if that copy isn't. If training on copyrighted images is fair use then it doesn't matter how you got those images. Think about it this way: if the opposite were true, then being able to review a movie would be a privilege you have to pay for by buying the movie, rather than just something you can do because the 1st Amendment exists.
A concrete example of this is Google Images. Image search is fair use, even though they index shittons of infringing images.
The judge isn't going to touch the "output is infringing" argument mainly because the authors didn't actually connect the dots between their work and a specific output. That's an argument that would stick if you were suing a user of OpenAI's services, not OpenAI themselves. "Output is infringing" is going to be part of the fair use analysis anyway (specifically, the market substitution factor), so this extra claim is superfluous.
> Otherwise WTF have we been prosecuting individuals for all these years for pirating movies and music etc?
We don't. Nobody got prosecuted for downloading movies. Because that isn't illegal.
Whats illegal is distributing copies to other people.
Its OK though. Its a common misconception that internet piracy is illegal.
Only the distribution part has ever been successfully prosecuted.
> They didn’t do that. They used stolen copies instead.
Using stolen copies isn't illegal. It is completely legal to use other people's work without their permission. Distributing copies is the only part that is illegal.
> if we just throw copyright law completely out of the window at the same time.
Nothing is being throw out the window. The facts that I have described have always been true.
You're conflating different matters here by merging together "legal/illegal" and "successfully prosecuted" - in general, there is a big gap of things prohibited by law which can't and won't ever be prosecuted by the state as they aren't felonies or misdemeanors, but only justify civil claims of compensation, if the other party wants to sue them and cares to put in the effort (and money) to actually do so. I think it's not appropriate to call all the latter scenarios "legal".
> You're conflating different matters here by merging together "legal/illegal" and "successfully prosecuted"
There isn't any difference between these two things.
If something has never and will never be enforced then it is just words on a piece of paper.
The only way that words on a piece of paper turn into something that actually matters is through enforcement. So the point stands.
It doesn't matter what your interpretation of the words on a piece of paper are if no judge or jury has ever agreed with you and will never agree with you, as evidenced by a successful case.
"not prosecuted" doesn't mean "not enforced", it means "not enforced by state". Companies can successfully enforce copyright infringement and sometimes do so - there's a world of difference "if I don't attract too much attention, it's very unlikely they'll come after me" and "what I'm doing is legal".
Well then you can reinterprete everything that I said previously to instead be "nobody has either been prosecuted or sued successfully by either the state or civility for the downloading part".
I didn't realize that the problem that you had with my post was not with the clear substance of it, and was instead with the dictionary definition and usage of one single word.
But if that one singular word was the issue then I am happy to give this multi sentence clarification even though I think my point was clear and obvious from the beginning.
So my original point stands once I have corrected the slightly incorrect usage of one single word.
But feel free to show an example of a single person ever being successfully sued for just the "downloading" part.
(civilly! Not by the state instead by a company! Important clarification here. I don't want you to misinterpret. Because apparently that one word was a big deal and huge misunderstanding.)
Sure, it was quite widespread in early 2000s, with tens of thousands people sued. While the main poster cases like the $675000 award of Sony vs Tenenbaum also included assertions of distribution (AFAIK not proving any specific upload, but only "making available"), there are cases where only dowloads were asserted such as Cassi Hunt (https://www.acslaw.org/?post_type=acsblog&p=3005) and others, and the key part is the many thousands of people who settled for ~$3000 each, which I'd count as "successfully suing" even if the case never went to a court, as that acts as mass enforcement through civil means that has at least some impact on how people behave.
All the trial cases which went through the courts e.g. Sony vs Tenenbaum also were judged that the downloading part was also infringing activity, it's just that those defendants did both downloading and [offering for] distribution.
So then you could have instead just said. "I agree with you completely, 100%, that there has not been a singular case in the history of the USA copyright law where a judge has ruled against someone who only did the downloading part".
I am glad after half a dozen posts I was finally able to get you to agree to this.
You could have just said that from the beginning really. I am happy that you agree with me on my main point!
Of course it is illegal in many countries to download pirated movies. And the USA put a lot of pressure on countries where it isn't illegal - I know that well enough, in my country it was and is legal.
It’s not just distribution. Making unauthorised reproductions of copyrighted work is also copyright infringement.
I take your point that prosecutions are focused more on the sharers than individual downloaders, but OpenAI still infringed copyright by making unauthorised reproductions of the authors’ works (by downloading pirated copies of them).
I imagine you'd still be liable as some sort of accomplice? Unless you had a reasonable belief that you obtained a legal copy, or had the copy forced onto you somehow.
The piracy argument can be fixed by OpenAI buying one copy of each work. The overall question of whether they're allowed to train on copyrighted material without permission seems much larger and more interesting.
Can you even buy a digital copy of most recent work that doesn't come with Terms & Conditions attached? A physical book is one thing but it's really hard to buy digital media without agreeing to strict terms, often allowing access to the media to be revoked post-sale (what would that even mean after you use it to train a model?)
I've assumed for a while that avoiding the nightmare of all those different agreements was one of the main reasons they chose to pirate the works instead.
May not be that easy. It may be the case we'll find that (in much the way as tech has made popular ironically) authors/IP creators aren't willing to sell "machine training rights" to their work. A concept that could find itself magicked into existence by artists/publishers.
Legally if they sell it, they no longer own it and can't determine how it is used.
If they license it, that's a different story (they can make a claim of how I breached terms of use, still no copyright issue though).
They have absolutely no say in how I use the book they sold to me.
Only if I reproduce that book, the actual physical reproduced work is of concern to copyright law, nothing else (my use of the book I bought from you is of no concern).
All it takes is a EULA page in the book they sell you to remedy that; a practice well trod by the tech industry to establish the licensing vs. Assumed First Sale relationship.
I find it naive to simply assume that publishers won't play tit-for-tat with those that are trying to indirectly monetize their output. I had hope that a return to First Sale sanity may win the day, but as of late I have a dread we're briskly walking down the path where no one is going to be "selling" in the conventional sense anymore. Not with the stakes at hand.
You're 100% and raise a good point.It would only work with digital goods though. At some point people seriously have to realise copyright is just a tool with pro's and con's, a balancing act between differing interests. I agree the principle of first sale should apply even to digital goods. There's some hope in the EU that this is on occasion the case, but much room for improvement everywhere.
If learning from a purchased, copyrighted work is illegal, colleges are in real trouble. Textbook publishers will be thrilled though: this book is $200 to read, but you need an additional license to learn anything from it.
Learning and then reproducing parts of a work already is illegal, depending on context.
Also, “learning” here isn’t the same thing as what college students do. For one thing, you can’t copy-paste a college student’s whole brain. You can’t own and sell their brains (well, ah, you know what I mean—not the organ trade). For another… it’s simply not the same thing. It might be, we suppose, similar to some parts of how human learning work, but it’s plainly not identical and may not be especially close. We use the same word for what LLMs do because it’s a convenient and useful-enough analogy, but that doesn’t mean we can prove anything else simply based on our having re-used the word “learn” here.
“That’s learning, this is learning, so all things that apply to one apply to the other”—no, that doesn’t follow, it takes more than that.
Only the reproduction is judged whether or not it is too close to the original or not (the physical reproduced item, not the reproduction process or concept).
Nothing else is subject to copyright.
Copyright protection does not extend to ideas, procedures, processes, systems, methods of operation, concepts, principles, or discoveries. Copyright protects only the expression of an idea, not the idea itself.
I have a more general question. Say I read a science fiction book which has descriptions of some futuristic technologies. I get inspired by it and spend lot of my time and energy becoming an expert in the required engineering and technologies and invent a machine/process to make that futuristic technology a reality. If I attempt to commercialize my work, can I get sued for copyright infringement by the author(s) of the science fiction book that originally described this technology?
I believe the actual law is the clearest answer to your question:
US copyright law, Subject matter of copyright 102.(b) "In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work." (https://www.copyright.gov/title17/92chap1.html)
Quite explicitly, copyright law doesn't grant the author any exclusive rights to the ideas expressed in the work, just on the particular creative expression.
Copyright doesn't protect ideas, only specific texts and images.
In general there are copyrights, trademarks, design patents and utility patents, all protecting different things under different conditions. They are not interchangeable.
Normally that's the case even though at least one of those books was taken down based on (rather dubious imo) violation of rights on derivative works. It was never a verbatim rewrite nor a trademark violation. The Dutch court even proceeded to make a list of similar ideas between two books which I think is explicitly disallowed under the US version of the copyright law.
i'm pretty sure this is allowed as Transformative Use (in the usa) (unless the science fiction book was full of patented inventions still covered by patent law then i think maybe you're no longer in the clear.)
It's completely unsurprising that the entire train of extremely harsh, heavy handed and heavily-applied arguments about using pirated works and copyrighted materials in any way without their holder's permission get so commonly applied to ordinary people and small players of any kind, only to be swept away in legalese when the creators themselves try to use them against a large well connected industry with deep pockets and lots of lobbying weight.
Suddenly those claiming copyright have become unreasonable and legal contortions of all kinds appear from the blue sky to show why there's little basis to their claims. Never mind that much of the learning material for the LLMs of companies such as OpenAI is explained only opaquely and its legal status (pirated or not, fair use or not, etc) is ambigious at best. Considering that these uses of such material involve a huge load of of obviously commercial aims, it's laughable how biased the legal system has been in their favor so far.
I personally don't favor heavy handed copyright laws in any context but if they're going to be applied with so much snctimony by corporations, academia and governments, then at least the application should be even handed, instead of so blatantly loaded with apparent exemptions.
While I'm not optimistic about AI in general, if they're going to train LLMs I think they ought to use good data, and published books may on average be better for that than scraping the web.
Relatedly, I wonder how many humans have ever learned anything from pirated ebooks --- in some countries, I bet that number is close to 100%.
and they can bloody well pay for that like everyone else. get a license for the content that makes your machine that makes you money. it's that simple.
How is this different from the authors of (business book) seeking royalties from (bigtech CEO) because he/she ingested knowledge (training) from this book? Humankind needs to build on top of each other’s knowledge to keep evolving.
Well the biggest difference is that we're not talking about people in this case. We can actually just discriminate against computer programs, they have no rights and they don't even have the agency to act on their own behalf. They're just tools.
>Humankind needs to build on top of each other’s knowledge to keep evolving.
I agree, but that's never stopped copyright law in the past. What's frustrating is that copyright law is suddenly up for debate now that the copyrights in question predominantly belong to individuals rather than large corporations, an what's really frustrating is that it seems they want to carve out an exemption for this one specific case instead of reforming copyright law.
You should be more clear about the topics of legislation in question.
In matters involving internet service providers in the US, the Democratic party in Congress generally doesn't bother to propose internet laws either way, but the Republican party tends to strip away the FCC's regulatory authority and actively oppose attempts to bring a fraction back [1][2].
> Arguing that OpenAI caused economic injury by unfairly repurposing authors' works, even if authors could show evidence of a DMCA violation, authors could only speculate about what injury was caused
That's an interesting angle I hadn't seen coming, but it makes sense: the system isn't reliable enough to output a book proper, so you'd never use it to read a book (can't ask it to output lord of the rings page 47 ... yet), and summaries were always okay to post as far as I know
So if I post excerpts of books that neither I nor the querier has a license for when they guess/engineer a search query, even if that (with lots of effort) can amount to the whole book, that's apparently not causing any economic injury to the author. It's not ruled to not be copyright infringement yet, so maybe you can force removal from the market but you can't be awarded damages as I understand it?
Separately, I'm a bit surprised the authors haven't done much in the way of discovery and just alleged baselsssly that OpenAI's training process removed copyright notices. The judge says it's unsubstantiated and there's even counter-evidence. Wouldn't it be a simple matter to subpoena a list of works/sources that were used and under which license? That would also be interesting to learn for others working in the field
> Arguing that OpenAI caused economic injury by unfairly repurposing authors' works, even if authors could show evidence of a DMCA violation, authors could only speculate about what injury was caused
Funny that inability to prove speculated injuries didn't stop judges from ordering people to pay hundreds of thousands in damages in piracy cases
I think the real problem with the lawsuit is that it doesn't financially matter. If I want to read a Stephen King novel I'm not going to ask chatgpt for it. Why get a mangled version when you can get a proper copy?
What if I merely demonstrate knowledge that means I've memorized the book? For example, being able to answer yes/no questions about what's on a particular page?
The point I'm trying to make here is that in this situation I have a representation of the entire book in my brain. Haven't I copied it into my neurons?
If we're technical, there is a substantial legal difference between your brain and an AI system, because the criteria for what counts as a copy is defined (in US) as "“Copies” are material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. The term “copies” includes the material object, other than a phonorecord, in which the work is first fixed." and all the law precedent doesn't consider any representation of something in your brain as "the work being fixed" , so that isn't a copy and anything that applies to copies doesn't apply to your memory, but do apply to any computer memory or any future technical method to make that representation.
So analogies to the brain aren't appropriate, because human memories are special and separate in the eyes of law, and if we'd make a machine that does literally exactly the same thing as your brain does, it will still NOT have the same legal treatment as your brain; there is no legal principle that it should get equal or similar treatment.
I don't see why brains don't satisfy the legal definition, if I have truly memorized the book, unless you are trying to claim that brains aren't material objects. Encoding a book by tattooing it on my back would count, wouldn't it? Or by encoding it in DNA and injecting it into some of my cells. Why would using the brain be different?
The definition of a "copy" requires it to be fixed in a tangible, durable medium. This law has explicitly chosen to make a strong, major black&white distinction between intangible things and things that are fixed e.g. on paper. According to legal precedent, memories in human brains are considered these intangible things and not considered fixed in a tangible medium, ergo, those do not count as "copies" as far as copyright law is concerned.
Tatooing on your back or a fixed encoding would IMHO indeed probably count as fixed in a tangible medium, but that's not really relevant. And there is some argumentation about why using the brain is different (e.g. that it does't permit the unchanged reproduction of the information stored, memories tend to be fleeting and incomplete), but again, that's not relevant and any flaws in that argument don't really matter.
The key point is that the argument is settled. It doesn't matter what argument you or I could make about whether using the brain is different or the same, no one cares about that argument, we have no right to re-try that question, because this discussion has been heard by the relevant courts, and it is over - it's now part of settled law, and it doesn't matter if you now figure out a better argument or the original argumentation was shoddy, the discussion is finished, and your opponent effectively has the right to demand that the judge ignores your argument and instead apply the existing case law.
As it has been legally "accepted" in precedent that memories do not count as fixed on a tangible medium, then that sticks, that effectively becomes part the legal definition of what "fixed on a tangible medium" axiomatically means. Any appeal to biology/physics/whatever is pretty much irrelevant - if future science comes up with incontrovertible evidence that indeed memories store a fixed, durable, unchanged copy and are literal equivalent of tatooing a copy on your back, that doesn't matter, all it means that the colloquial understanding of "fixed on a durable medium" (which then would include memories) has diverged from the legal definition of "fixed on a durable medium", which does not.
If you want to apply logic, in this context "memories don't count as fixed on a durable medium" is an axiom - changing it is possible by passing new law, but not purely by some arguments that it should be treated differently; a key common law principle is that we assume that the earlier cases have been correctly decided and don't re-litigate that with new arguments.
When ChatGPT gets agency regarding what goes into training data and what doesn't, gets paid minimum wage for the work it does, and can be held liable for violations of the law, your question will be very relevant.
Do you profit from people testing your ability to memorise the book and give them the ability to substantially recreate the original work and act in competition to the original author?
Thought question, not entirely related but if you want to go that that route it actually is.
If I generate some media in say Photoshop. I then send you a JPEG representation of said media. You then distribute a PNG copy of the image without license. Have you violated copyright law?
At what point is there enough parameters to an LLM that it is effectively just a compressed version.
How about deduplicated storage? Is an image stored on that and the reproduced using an index of some sort to distribute a violation.
If I put data into a thing,lets call it training and lets call the thing a model, and then I request the data out of it and get what is perceived as an exact replication of said thing did I create a copy.
Does it matter if we call the thing a hard drive instead?
The human brain has 100 billion neurons, is it just effectively compressing everything that it's ever seen? I don't actually know but my feeling is no.
If I ask you to draw Mickey Mouse, you can probably produce a very good representation of him. If I asked you to write the script of The Matrix, assuming you've seen it, I suspect you'd get all the plot points down and major quotes even if it has been years since you've seen it. Are you creating a copy? Absolutely! Don't distribute either of those things without a license. But does the fact that you are capable of making a copy of a thing when asked mean that you've violated copyright way back when you watched the Matrix? Is there a copy of the Matrix or Mickey Mouse in your brain?
I will take the strong position that neither our brains nor LLMs contain copies of data in the way that is a violation of copyright. But both are equally capable of generating copyright violating materials.
However, one can memorize something like a book, particularly if one uses known "memory palace" techniques. Some individuals are particularly good at this.
Of course it matters. If I put the right data into a paintbrush and canvas I'll reproduce copyrighted works too. Nobody is confusing the model for a Picasso anymore than they confuse a paintbrush for a painting. The law may rule differently for one reason or another, but these are obviously different categories of things.
The reason a jpeg is copyright infringement has more to do with its express purpose being to allow the user to view that copyrighted work. If it were bundled in a program that just allowed you to view the color histogram of famous works (and the author had the right to view those photos and didn't think it important to save bandwidth by precomputing those histograms) it likely wouldn't be infringement. If it were found out that people were downloading that program just to rip the bundled images out then the author might get in hot water anyway. Your distributed file example is similarly probably infringing.
The model has other capabilities, and I think it would be hard to argue that its purpose is copyright infringement (which is separate from what you seem to be doing, which is arguing that the model itself is infringement -- both a little easier and harder to argue because it pushes more on philosophical distinctions than statements of fact about how people are using a thing).
Separately, there are new classes of concerns these models introduce. We don't have to abuse copyright law to take the time to consider those effects. E.g., should voice cloning be allowed and to what degree? It's already illegal in a lot of contexts (fraud, ...), but we don't currently have many rights when it comes to our innate physical characteristics. To the extent those rights exist, you often have to waive them for basic services (e.g., a nontrivial fraction of leases and jobs stipulate that you give a permanent, <much other legal jargon>, license for them to use your image for nearly any purpose, including falsely characterizing your approval of the property in advertisements and marketing materials -- unless covered under libel/slander and a couple other carveouts they're probably not punishable). Can studios just refuse to hire voice actors for more than one session? Is that good for society? Can I clone passers-by on the street to play in my commercial? These are new enough capabilities (at least at their current scale) that they're not very well legislated, and I wouldn't be surprised if we saw an expansion of something like "moral rights" to cover them.
> I then send you a JPEG representation of said media. You then distribute a PNG copy of the image without license.
The purpose and function of an image format is to represent a single piece. The representation can vary in accuracy and can be changed to another representation (JPEG to PNG, or one JPEG implementation to another JPEG implementation), but the underlying piece is supposed to be the same in intent and a majority of the time is the same in practice. Open the JPEG image using any program that implements the JPEG specification and you will get the same image as with a different such program. The same would apply to the use of encryption on the image, but not to a cryptographic hash (designed to be one-way). Decrypt the encrypted image and you'll get something that's the same in intent and in technicality as the original image. If you don't use a tool which has the purpose of decrypting things then you almost certainly won't get the same image back. The same only partially but at least partially applies to an AI: if you don't try to use the AI to reproduce an existing work then you still might get a reproduction of an existing work; the probability of such a result varies greatly depending on the prompt.
The purpose of an LLM isn't to reproduce one or more works - or rather, sections of expression - in the dataset. The purpose of an LLM is to produce speech similar to a human's response. The purpose of an image generator model is to produce images that have the characteristics specified in the prompt. In order to produce a copy of something in the training set, the prompt usually needs to reference a specific work, a related person (e.g. an author), a related work, or an attribute that is strongly associated with a particular work/author. Regarding the latter, there was a Hacker News post (that I can't find because I forgot the post title) from a month or two back about an AI image generator that produced images of the robot C-3PO from Star Wars even though the prompt was about "space" and "robot" with no reference to Star Wars. My interpretation is that the AI model had a strong association between space robot and Star Wars because C-3PO is (I speculate) one of the most common space-related robots that people talk about online. Or perhaps, the Star Wars works in the training set made up a majority of the works associated with both "space" and "robot". But I digress.
The likelihood that an AI produces a copy of existing expression depends on the prompt. A user who encounters such a case can avoid liability by not using and not sharing the output, and otherwise the output might not substitute for the original expression for the user's purposes. So in most cases I think liability for the infringinging outputs of an AI model should fall solely on the prompter. The liability that should fall on the developer of the AI model doesn't have to be binary. There can be heavier penalties on the developer for an AI that is more likely to reproduce C-3PO when given a vague prompt such as "space robot", lesser penalties on the developer for a model that only produces C-3PO when the prompt is at least as specific as "space war robot", and even lesser or no penalties for a model that only produces C-3PO from a prompt as specific as "golden space robot". The threshold for vague prompt would vary; for a prompt such as "painting of melting clock" I would excuse a partial reproduction of Salvador Dalí's The Persistence of Memory.