Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I didn't know counting objects was a problem. That's pretty ironic because the very first implementation of a neural net (AFAIK) is the numa-rete artificial retina developed at the Biological Computer Lab [0] circa 1960. It was a parallel analog computer composed of "nuerons" each with a photocell that could be arranged in a grid and count "the number of objects independent of their size, location and form, and independent of strength of illumination" [1] - this paper may be of interest to those in the field, "Perception of Form in Biological and Man Made Systems" Heinz Von Foerster 1962

[0] https://distributedmuseum.illinois.edu/exhibit/biological_co...

[1] https://sites.evergreen.edu/arunchandra/wp-content/uploads/s...



It really shouldn't be surprising that these models fail to do anything that _they weren't trained to do_. It's trivially easy to train a model to count stuff. The wild thing about transformer based models is that their capabilities are _way_ beyond what you'd expect from token prediction. Figuring out what their limitations actually are is interesting because nobody fully knows what their limitations are.


I agree that these open ended transformers are way more interesting and impressive than a purpose built count-the-polygons model, but if the model doesn't generalize well enough to figure out how to count the polygons, I can't be convinced that they'll perform usefully on a more sophisticated task.

I agree this research is really interesting, but I didn't have an a priori expectation of what token prediction could accomplish, so my reaction to a lot of the claims and counterclaims of this new tech is that it's good at fooling people and giving plausible but baseless results. It makes for good research but dangerous in the hands of a market attempting to exploit it.


> I agree that these open ended transformers are way more interesting and impressive than a purpose built count-the-polygons model, but if the model doesn't generalize well enough to figure out how to count the polygons, I can't be convinced that they'll perform usefully on a more sophisticated task.

I think people get really wrapped into the idea that a single model needs to be able to do all the things, and LLMs can do a _lot_, but there doesn't actually need to be a _one model to rule them all_. If VLMs are kind of okay at image intepretation but not great at details, we can supplement them with something that _can_ handle the details.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: