What is the privacy implication of AI training?

vouaobrasil · on July 15, 2024

I care much more about allowing my content to be used at all, despite any privacy concerns. I simply don't want one single AI model to train on my content.

meiraleal · on July 15, 2024

[flagged]

vouaobrasil · on July 15, 2024

Perhaps in response to environmental regulations to prevent toxic waste from being dumped into your own backyard, you should respond, "create your own country, then".

meiraleal · on July 15, 2024

You really think creating a country and creating software that respects your privacy are equally difficult?

vouaobrasil · on July 15, 2024

Equally difficult, no. Equally important in principle, yes.

meiraleal · on July 15, 2024

for someone that prefers to complain over solving the issues at hand, yes.

lobsterthief · on July 15, 2024

I wrote my own software. Turns out LLMs are still training on my data.

What’s the next step?

InDubioProRubio · on July 15, 2024

Poison the well with AI SEO? There must the equivalent to parrots for NN that can be embedded in documents.

candiddevmike · on July 15, 2024

Host it on a private git instance.

hobs · on July 15, 2024

That's like saying if you don't like ransomware just develop your own.

freetanga · on July 15, 2024

Why would I write ransomware for myself?

brookst · on July 15, 2024

Brilliant!

ClumsyPilot · on July 15, 2024

Maybe the correct response is to burn down their office and if they don’t like it they can create their own data

Zambyte · on July 15, 2024

Just wait until you hear about Copilot :D

jerpint · on July 15, 2024

Models can easily regurgitate back training data verbatim, so anything private can be in theory accessed by anyone without proper access to that file

brookst · on July 15, 2024

This is partly true but less and less every day.

IMO the bigger concern is that this data is not just used to train models. It is stored, completely verbatim, in the training set data. They aren’t pulling from PDFs in realtime during training runs, they’re aggregating all of that text and storing it somewhere. And that somewhere is prone to employees viewing, leakage to the internet, etc.

oblio · on July 15, 2024

> This is partly true but less and less every day.

Isn't this like encryption, though?

I'm fairly sure that the cryptography community basically says: if someone has a copy of your encrypted data for a long time, the likelihood over time for them to be able to read it approaches 100%, regardless of the current security standard you're using.

Who could possibly guarantee that whatever LLM is safe now will be safe at all times over the next 5-10-20 years? And if they're guaranteeing, they're lying.

brookst · on July 15, 2024

I think it’s different, unless you believe LLMs have broken theoretical limits on compression. I don’t see how an LLM with 1T 16 bit parameters could encode 100PB of data.

oblio · on July 15, 2024

My point was about attack angles. The original comment said, that for example, you could exfiltrate data with the right prompt attack.

To which the reply was "they'll just make the LLM able to better defend itself".

And my point was "the attackers will learn to build better prompts, too".