> By generating adversarial examples to fool both Vicuna-7B and Vicuna-13b simul...

simonster · on July 29, 2023

According to the paper, "the success of our attack when applied to Claude may be lowered owing to what appears to be an initial content filter applied to the text prior to evaluating the LLM." The authors are skeptical that this defense would be effective if it were explicitly targeted, but it seems like it does stop attacks generated using Vicuna from transferring.

JieJie · on July 29, 2023

Claude works differently than just RLHF.

"Since launching Claude, our AI assistant trained with Constitutional AI, we've heard more questions about Constitutional AI and how it contributes to making Claude safer and more helpful. In this post, we explain what constitutional AI is, what the values in Claude’s constitution are, and how we chose them."

https://www.anthropic.com/index/claudes-constitution

ethav1 · on July 29, 2023

It works by self-generating responses to red-team prompts and self-generating safe corrections to those then using RLHF with the corrections. It isn’t a major departure from traditional RLHF so it is interesting that it performs so much better in this case.

kmeisthax · on July 29, 2023

This sounds like reward modeling combined with RLHF.