GPT-4o is very good at some visual tasks like optical character recognition. So ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		energy123 on July 11, 2024 \| parent \| context \| favorite \| on: Vision language models are blind GPT-4o is very good at some visual tasks like optical character recognition. So the selective blindness might just be what you say here -- all of its capacity is dedicated to minimizing loss on a few narrow tasks that had the most training data (like OCR). So it's not necessarily an inherent failure of the architecture to generalize, it could just be a capacity issue that will naturally be resolved with more scale.

sushid on July 11, 2024 [–]

Is that not just traditional OCR applied on top of LLM?

energy123 on July 11, 2024 | | [–]

It's possible they have a software layer that does that. But I was assuming they don't, because the open source multimodal models don't.

maxlamb on July 11, 2024 | | [–]

No it’s not, it’s a multimodal transformer model.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact