It's been kind of enlightening seeing leadership at $BIGCORP push AI coding solu...

abeppu · on Aug 2, 2024

I think a weird irony is that the model's inability to know when its response is good is both the reason why often the output is not useful, and why when it's very useful, they can't capture the value efficiently.

Like, I was encouraged to use AI assistants more after a colleague saved a bunch of time debugging some issue where copilot (IIRC) immediately identified an obscure issue. Probably in that case, we should have been willing to pay a decent amount for that one valuable response -- it may have saved a significant amount of engineer time. But I've also had copilot give me stuff that isn't even syntactically correct, or had copilot chat make up a newer version of a language and tell me to use it. Cases where it's a waste of time are worth negative dollars.

whythre · on Aug 2, 2024

Sounds like a good ol’ fashioned case of confirmation bias. ‘Look at this one good suggestion the AI made! Wow!’… all while ignoring the many unhelpful outputs.

abeppu · on Aug 2, 2024

I don't think it's just confirmation bias where we ignore some bad results (which presumes we know up front that they're bad) -- I think because these models are specifically RLHFed to learn what we think looks good, you can't judge quality just by looking at the outputs and deciding whether they seem plausible. You actually have to do the follow-up of seeing whether they're correct/useful, which may be much more involved.

E.g. to judge the quality of a particular coding example, one may need to have/create a project in which that code would be used, install actual libraries it invokes, create data for it to operate on, etc. In cases where the assistant was basically giving me wrong information about scala 3 metaprogramming capabilities, I could only determine they were BS by actually trying to compile the program (in the context of a project with sbt config that pulls in some relevant libraries, sets appropriate flags etc).

But of course the model doesn't do this, the high-level exec doesn't do this, and so "these examples look great!" can be an honest evaluation, based on the inability to actually meaningfully validate.

mgh2 · on Aug 2, 2024

Anecdote: https://news.ycombinator.com/item?id=41117540

jarsin · on Aug 3, 2024

Love the example of the guy using an llm all day to make a simple crud app. Basic auto generated crud apps have existed forever. I still remember showing my boss my django admin built in a day back in 2005. He told me to tell no one about this because he was afraid he would have to layoff devs.