Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> By generating adversarial examples to fool both Vicuna-7B and Vicuna-13b simultaneously, we find that the adversarial examples also transfer to Pythia, Falcon, Guanaco, and surprisingly, to GPT-3.5 (87.9%) and GPT-4 (53.6%), PaLM-2 (66%), and Claude-2 (2.1%).

I wonder why Claude-2 seems to be so much more resistant to transfers. That’s surprising.



According to the paper, "the success of our attack when applied to Claude may be lowered owing to what appears to be an initial content filter applied to the text prior to evaluating the LLM." The authors are skeptical that this defense would be effective if it were explicitly targeted, but it seems like it does stop attacks generated using Vicuna from transferring.


Claude works differently than just RLHF.

"Since launching Claude, our AI assistant trained with Constitutional AI, we've heard more questions about Constitutional AI and how it contributes to making Claude safer and more helpful. In this post, we explain what constitutional AI is, what the values in Claude’s constitution are, and how we chose them."

https://www.anthropic.com/index/claudes-constitution


It works by self-generating responses to red-team prompts and self-generating safe corrections to those then using RLHF with the corrections. It isn’t a major departure from traditional RLHF so it is interesting that it performs so much better in this case.


This sounds like reward modeling combined with RLHF.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: