This is kind of the visual equivalent of asking an LLM to count letters. The fai...

This is kind of the visual equivalent of asking an LLM to count letters. The failure is more related to the tokenization scheme than the underlying quality of the model.

I'm not certain about the specific models tested, but some VLMs just embed the image modality into a single vector, making these tasks literally impossible to solve.