Not precisely RLHF, probably a policy model trained on user responses. RL works ...

		astrange 42 days ago \| parent \| context \| favorite \| on: Sycophancy is the first LLM "dark pattern" Not precisely RLHF, probably a policy model trained on user responses. RL works on responses from the model you're training, which is not the one you have in production. It can't directly use responses from previous models.