Given that the author is using the specific `box_2d` format, it suggests that he is taking advantage of this feature, so I wanted to highlight it. My intuition is that a base multimodal LLM without this type of post-training would have much worse performance.
That's true, it's also why I didn't benchmark against any other model provider.
It has been tuned so heavily on this specific format that even a tiny change, like switching the order in the `box_2d` format from `(ymin, xmin, ymax, xmax)` to `(xmin, ymin, xmax, ymax)` causes performance to tank.
That's interesting because it suggests the meaning and representation are very tightly linked; I would expect it to be less tightly coupled given Gemini is multimodal.
Post-training allows leveraging the considerable world and language understanding of the underlying pretrained model. Intuition is that this would be a boost to performance.
Given that the author is using the specific `box_2d` format, it suggests that he is taking advantage of this feature, so I wanted to highlight it. My intuition is that a base multimodal LLM without this type of post-training would have much worse performance.