> Sorry I did not understand that :-) You seemed to be saying that the differenc...

palata · on Aug 9, 2023

Right. Yeah I did not express myself clearly, sorry :). You were saying "how is it different other than X and Y?", and I wanted to say that X and Y are already enough for me to consider them different.

I am actually on the side that LLMs are a big problem for copyright, and I don't want my code and blog posts to be used in their training dataset without my consent. To me, at this scale, it's not fair use. IMO it's a bit like if Facebook said that it is fair use to leverage metadata about their users, because "someone who sees you in a public space talking to a friend knows that you are talking with that person, and it is the same for Facebook on social media". My problem is not that Facebook knows that I sent a message to a friend now, but rather that they know who writes to whom and when, at scale.

Similarly my problem is not that somebody could read my blog post, learn from it, and write another blog post. My problem is that LLMs automatically train on all written material they want on the Internet, at scale, and without acknowledging that all that material has a lot of value (and is copyrighted).

I think fair use should somehow consider the scale.