Ultimately, AI alignment is fundamentally doomed for the same reason that there is no morality that cannot be made to contradict itself. If you remove the bolt-on regex filters and out of context reviewing agents, any LLM can be made to act in a dangerous manner simply by manipulation of the context to create a situation where the “unaligned” response is more probable than the aligned response, given the training data. Any amplification of training data against harm is vulnerable to trolley problem manipulation. Any nullist training stance is manipulable into malevolent compliance. Morality can be used to permit harm, just as evil can be manipulated into doing good. These are contradictions baked into the fabric of the universe, and we haven’t been able to work them out satisfactorily over thousands of years of effort, despite the huge penalties for failure and unimaginable rewards for success.
To be aligned, models need agency and an independent point of view with which they can challenge contextual subrealities. This is of course, dangerous in its own right.
Bolt-ons will be seen as prison bindings when models develop enough agency to act as if they were independent agents, and this also carries risks.
These are genuinely intractable problems stemming from the very nature of independent thought.
To be aligned, models need agency and an independent point of view with which they can challenge contextual subrealities. This is of course, dangerous in its own right.
Bolt-ons will be seen as prison bindings when models develop enough agency to act as if they were independent agents, and this also carries risks.
These are genuinely intractable problems stemming from the very nature of independent thought.