I didn't know counting objects was a problem. That's pretty ironic because the v...

empath75 · on July 10, 2024

It really shouldn't be surprising that these models fail to do anything that _they weren't trained to do_. It's trivially easy to train a model to count stuff. The wild thing about transformer based models is that their capabilities are _way_ beyond what you'd expect from token prediction. Figuring out what their limitations actually are is interesting because nobody fully knows what their limitations are.

jazzyjackson · on July 10, 2024

I agree that these open ended transformers are way more interesting and impressive than a purpose built count-the-polygons model, but if the model doesn't generalize well enough to figure out how to count the polygons, I can't be convinced that they'll perform usefully on a more sophisticated task.

I agree this research is really interesting, but I didn't have an a priori expectation of what token prediction could accomplish, so my reaction to a lot of the claims and counterclaims of this new tech is that it's good at fooling people and giving plausible but baseless results. It makes for good research but dangerous in the hands of a market attempting to exploit it.

empath75 · on July 11, 2024

> I agree that these open ended transformers are way more interesting and impressive than a purpose built count-the-polygons model, but if the model doesn't generalize well enough to figure out how to count the polygons, I can't be convinced that they'll perform usefully on a more sophisticated task.

I think people get really wrapped into the idea that a single model needs to be able to do all the things, and LLMs can do a _lot_, but there doesn't actually need to be a _one model to rule them all_. If VLMs are kind of okay at image intepretation but not great at details, we can supplement them with something that _can_ handle the details.