Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't see how this is "embarrassing" in the slightest. These models are not human brains, and the fact that people equate them with human brains is an embarrassing failure of the humans more than anything about the models.

It's entirely unsurprising that there are numerous cases that these models can't handle that are "obvious to humans." Machine learning has had this property since its invention and it's a classic mistake humans make dealing with these systems.

Humans assume that because a machine learning model has above human accuracy on task X that it implies that it must also have that ability at all the other tasks. While a human with amazing ability at X would indeed have amazing abilities at other tasks, this is not true of machine learning models The opposite thinking is also wrong, that because the model can't do well on task Y it must be unreliable and it's ability on task X is somehow an illusion and not to be trusted.



It is embarrassingly, shockingly bad, because these models are advertised and sold as being capable of understanding images.

Evidently, all these models still fall short.


It's surprising because these models are pretty ok at some vision tasks. The existence of a clear failure mode is interesting and informative, not embarrassing.


Not only are they capable of understanding images(the kind people might actually feed into such a system - photographs), but they're pretty good at it.

A modern robot would struggle to fold socks and put them in a drawer, but they're great at making cars.


I mean, with some of the recent demos, robots have got a lot better at folding stuff and putting it up. Not saying it's anywhere close to human level, but it has taken a pretty massive leap from being a joke just a few years ago.


They're hardly being advertised or sold on that premise. They advertise and sell themselves, because people try them out and find out they work, and tell their friends and/or audiences. ChatGPT is probably the single biggest bona-fide organic marketing success story in recorded history.


This is fantastic news for software engineers. Turns out that all those execs who've decided to incorporate AI into their product strategy have already tried it out and ensured that it will actually work.


> Turns out that all those execs who've decided to incorporate AI into their product strategy have already tried it out and ensured that it will actually work.

The 2-4-6 game comes to mind. They may well have verified the AI will work, but it's hard to learn the skill of thinking about how to falsify a belief.


You mean this one here? - https://mathforlove.com/lesson/2-4-6-puzzle/

Looking at the example patterns given:

  MATCH
  2, 4, 6
  8, 10, 12
  12, 14, 16
  20, 40, 60

  NOT MATCH
  10, 8, 6
If the answer is "numbers in ascending order", then this is a perfect illustration of synthetic vs. realistic examples. The numbers indeed fit that rule, so in theory, everything is fine. In practice, you'd be an ass to give such examples on a test, because they strongly hint the rule is more complex. Real data from a real process is almost never misleading in this way[0]. In fact, if you sampled such sequences from a real process, you'd be better off assuming the rule is "2k, 2(k+1), 2(k+2)", and treating the last example as some weird outlier.

Might sound like pointless nitpicking, but I think it's something to keep in mind wrt. generative AI models, because the way they're trained makes them biased towards reality and away from synthetic examples.

--

[0] - It could be if you have very, very bad luck with sampling. Like winning a lottery, except the prize sucks.


That's the one. Though where I heard it, you can set your own rule, not just use the example.

I'd say that every black swan is an example of a real process that is misleading.

But more than that, I mentioned verified/falsified, as in the difference between the two in science. We got a long way with just the first (Karl Popper only died in 1994), but it does seem to make a difference?


Who cares about execs? They know they work, but for them "works" is defined as "makes them money", not "does anything useful".

I'm talking about regular people, who actually use these tools for productive use, and can tell the models are up to tasks previously unachievable.


Execs are important in the context of a discussion of how LLMs are advertised and sold.


I see this complaint about LLMs all the time - that they're advertised as being infallible but fail the moment you give them a simple logic puzzle or ask for a citation.

And yet... every interface to every LLM has a "ChatGPT can make mistakes. Check important info." style disclaimer.

The hype around this stuff may be deafening, but it's often not entirely the direct fault of the model vendors themselves, who even put out lengthy papers describing their many flaws.


There's evidently a large gap between what researchers publish, the disclaimers a vendor makes, and what gets broadcast on CNBC, no surprise there.


A bit like how Tesla Full Self-Driving is not to be used as self-driving. Or any other small print. Or ads in general. Lying by deliberately giving the wrong impression.


It would have to be called ChatAGI to be like TeslaFSD, where the company named it something it is most definitely not


Humans are also shockingly bad on these tasks. And guess where the labeling was coming from…


Why do people expect these models, designed to be humanlike in their training, to be 100% perfect?

Humans fuck up all the time.


These models are marketed as being able to guide the blind or tutoring children using direct camera access.

Promoting those use cases and models failing in these ways is irresponsible. So, yeah, maybe the models are not embarrasing but the hype definitely is.


> Promoting those use cases and models failing in these ways is irresponsible.

Yes, exactly.


Well said.

It doesn't matter how they are marketed or described or held up to some standard generated by wishful thinking. And it especially doesn't matter what it would mean if a human were to make the same error.

It matters what they are, what they're doing, and how they're doing it. Feel free to be embarrassed if you are claiming they can do what they can't and are maybe even selling them on that basis. But there's nothing embarrassing about their current set of capabilities. They are very good at what they are very good at. Expecting those capabilities to generalize as they would if they were human is like getting embarrassed that your screwdriver can't pound in a nail, when it is ever so good at driving in screws.


You'd expect them to be trained on simple geometry since you can create arbitrarily large synthetic training set for that.


> is an embarrassing failure of the humans more than anything about the models

No, it's a failure of the companies who are advertising them as capable of doing something which they are not (assisting people with low vision)


But they CAN assist people with low vision. I've talked to someone who's been using a product based on GPT-4o and absolutely loves it.

Low vision users understand the limitations of accessibility technology better than anyone else. They will VERY quickly figure out what this tech can be used for effectively and what it can't.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: