The problem with the current crop of projectors such as LLaVA is that as far as I know do not take the previous conversation into account. You only really get zero shot responses. This means that you cannot steer the model towards paying attention to specific instruction related details. The projector simply creates a token representation of the visuals (not necessarily human language tokens) and the LLM just processes that as usual.
The original gpt4 did this too, it had almost no memory before or after the image provided. I haven’t tested gpt4o on this directly, but my feeling is that it’s better from casual usage.
I do think some of these thin line drawings are likely extra hard to tokenize depending on the image scaling sizes for tokenization. I’d wager thicker lines would help, although obviously not all of this is just ‘poor tokenization’.