I think a weird irony is that the model's inability to know when its response is...

whythre · on Aug 2, 2024

Sounds like a good ol’ fashioned case of confirmation bias. ‘Look at this one good suggestion the AI made! Wow!’… all while ignoring the many unhelpful outputs.

abeppu · on Aug 2, 2024

I don't think it's just confirmation bias where we ignore some bad results (which presumes we know up front that they're bad) -- I think because these models are specifically RLHFed to learn what we think looks good, you can't judge quality just by looking at the outputs and deciding whether they seem plausible. You actually have to do the follow-up of seeing whether they're correct/useful, which may be much more involved.

E.g. to judge the quality of a particular coding example, one may need to have/create a project in which that code would be used, install actual libraries it invokes, create data for it to operate on, etc. In cases where the assistant was basically giving me wrong information about scala 3 metaprogramming capabilities, I could only determine they were BS by actually trying to compile the program (in the context of a project with sbt config that pulls in some relevant libraries, sets appropriate flags etc).

But of course the model doesn't do this, the high-level exec doesn't do this, and so "these examples look great!" can be an honest evaluation, based on the inability to actually meaningfully validate.