Alternative Title: Claude Code puts words in my mouth (Self Prompt Injection)
I originally thought that this was just misunderstanding attribution in discussions. This does seem to be a harness bug. Or at least an ontology bug. In my work with LLM frameworks, it was always odd that tool call results are sometimes marked in the convo as coming from the "User", I think that could be fundamentally what's enabling this bug to happen. Neither the LLM nor the harness should be able to claim something came from the user.
This is command injection. I don't know enough to see if cryptography is part of the right answer but it might be. A hash of the user message, signed, public key private key, harness is coded to only allow signed messages issue instructions. Yes, that might be overkill, but thinking about the types of things agent harnesses are used for... I think the safety argument starts to speak for itself... This has never happened to me using CC though, for what it's worth.
I originally thought that this was just misunderstanding attribution in discussions. This does seem to be a harness bug. Or at least an ontology bug. In my work with LLM frameworks, it was always odd that tool call results are sometimes marked in the convo as coming from the "User", I think that could be fundamentally what's enabling this bug to happen. Neither the LLM nor the harness should be able to claim something came from the user.
This is command injection. I don't know enough to see if cryptography is part of the right answer but it might be. A hash of the user message, signed, public key private key, harness is coded to only allow signed messages issue instructions. Yes, that might be overkill, but thinking about the types of things agent harnesses are used for... I think the safety argument starts to speak for itself... This has never happened to me using CC though, for what it's worth.