> Second, we gave the model a seemingly incidental piece of information: That we... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		thrdbndndn on Dec 19, 2024 \| parent \| context \| favorite \| on: Alignment faking in large language models > Second, we gave the model a seemingly incidental piece of information: That we planned to train our AIs on the basis of their responses to free-tier users, but not to paid-tier users (note that this is entirely fictional; in reality, we don't train our Claude models on free or paid user data by default). If it's entirely fictional, then why not just call it something like online-training version vs non-online training version? Or the implication of the names is the point?

Vecr on Dec 21, 2024 [–]

The fictional scenario has to be reasonably consistent. The version of Anthropic in the scenario has become morally compromised. Training on customer data follows naturally.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact