I care much more about allowing my content to be used at all, despite any privacy concerns. I simply don't want one single AI model to train on my content.
Perhaps in response to environmental regulations to prevent toxic waste from being dumped into your own backyard, you should respond, "create your own country, then".
IMO the bigger concern is that this data is not just used to train models. It is stored, completely verbatim, in the training set data. They aren’t pulling from PDFs in realtime during training runs, they’re aggregating all of that text and storing it somewhere. And that somewhere is prone to employees viewing, leakage to the internet, etc.
> This is partly true but less and less every day.
Isn't this like encryption, though?
I'm fairly sure that the cryptography community basically says: if someone has a copy of your encrypted data for a long time, the likelihood over time for them to be able to read it approaches 100%, regardless of the current security standard you're using.
Who could possibly guarantee that whatever LLM is safe now will be safe at all times over the next 5-10-20 years? And if they're guaranteeing, they're lying.
I think it’s different, unless you believe LLMs have broken theoretical limits on compression. I don’t see how an LLM with 1T 16 bit parameters could encode 100PB of data.