RL works on responses from the model you're training, which is not the one you have in production. It can't directly use responses from previous models.
RL works on responses from the model you're training, which is not the one you have in production. It can't directly use responses from previous models.