to be fair that image has the resolution of a flip phone from 2003

malfist · 2025-12-11T21:45:28 1765489528

If I ask you a question and you don't have enough information to answer, you don't confidently give me an answer, you say you don't know.

I might not know exactly how many USB ports this motherboard has, but I wouldn't select a set of 4 and declare it to be a stacked pair.

AstroBen · 2025-12-11T22:00:23 1765490423

No-one should have the expectation LLMs are giving correct answers 100% of the time. It's inherent to the tech for them to be confidently wrong

Code needs to be checked

References need to be checked

Any facts or claims need to be checked

malfist · 2025-12-11T22:56:33 1765493793

According to the benchmarks here they're claiming up to 97% accuracy. That ought to be good enough to trust them right?

Or maybe these benchmarks are all wrong

JimDabell · 2025-12-12T10:49:40 1765536580

Something that is 97% accurate is wrong 3% of the time, so pointing out that it has gotten something wrong does not contradict 97% accuracy in the slightest.

refactor_master · 2025-12-12T09:28:11 1765531691

Gemini routinely makes up stuff about BigQuery’s workings. “It’s poorly documented”. Well, read the open source code, reason it out.

Makes you wonder what 97% is worth. Would we accept a different service with only 97% availability, and all downtime during lunch break?

TeMPOraL · 2025-12-12T12:11:47 1765541507

I.e. like most restaurants and food delivery? :). Though 3% problem rate is optimistic.

AstroBen · 2025-12-11T23:32:32 1765495952

Does code work if it's 97% correct?

It's not okay if claims are totally made up 1/30 times

Of course people aren't always correct either, but we're able to operate on levels of confidence. We're also able to weight others' statements as more or less likely to be correct based on what we know about them

fooker · 2025-12-12T09:29:06 1765531746

> Does code work if it's 97% correct?

Of course it does. The vast majority of software has bugs. Yes, even critical one like compilers and operating systems.

mbesto · 2025-12-12T15:11:23 1765552283

> Or maybe these benchmarks are all wrong

You must be new to LLM benchmarks.

dolmen · 2025-12-12T06:49:50 1765522190

"confidently" is a feature selected in the system prompt.

As a user you can influence that behavior.

malfist · 2025-12-12T14:09:52 1765548592

No it isn't. It isn't intelligent, it's a statistical engine. Telling it to be confident or less confident doesn't make it apply confidence appropriately. It's all a facade

ben_w · 2025-12-12T13:24:27 1765545867

That shouldn't be what causes this problems; if we can see it's wrong despite the low resolution, the AI isn't going to fully replace humans for all tasks involving this kind of thing.

That said, even with this kind of error rate an AI can speed *some* things up, because having a human whose sole job is to ask "is this AI correct?" is easier and cheaper than having one human for "do all these things by hand" followed by someone else whose sole job is to check "was this human output correct?" because a human who has been on a production line for 4 hours and is about ready for a break also makes a certain number of mistakes.

But at the same time, why use a really expensive general-purpose AI like this, instead of a dedicated image model for your domain? Special purpose AI are something you can train on a decent laptop, and once trained will run on a phone at perhaps 10fps give or take what the performance threshold is and how general you need it to be.

If you're in a factory and you're making a lot of some small widget or other (so, not a whole motherboard), having answers faster than the ping time to the LLM may be important all by itself.

And at this point, you can just ask the LLM to write the training setup for the image-to-bounding-box AI, and then you "just" need to feed in the example images.

redox99 · 2025-12-12T00:00:39 1765497639

It's trivial for a human that knows what a pc looks like. Maybe mistaking displayport for hdmi.