Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A detail that is not mentioned is that Google models >= Gemini 2.0 are all explicitly post-trained for this task of bounding box detection: https://ai.google.dev/gemini-api/docs/image-understanding

Given that the author is using the specific `box_2d` format, it suggests that he is taking advantage of this feature, so I wanted to highlight it. My intuition is that a base multimodal LLM without this type of post-training would have much worse performance.



That's true, it's also why I didn't benchmark against any other model provider.

It has been tuned so heavily on this specific format that even a tiny change, like switching the order in the `box_2d` format from `(ymin, xmin, ymax, xmax)` to `(xmin, ymin, xmax, ymax)` causes performance to tank.


That's interesting because it suggests the meaning and representation are very tightly linked; I would expect it to be less tightly coupled given Gemini is multimodal.


It's really impressive what Gemini models can do. Segmentation too! https://ai.google.dev/gemini-api/docs/image-understanding#se...


this is very cool!


I was really shocked when I first see this but yes it's in training data. Not thinking feature.


Why do they do post training instead of just delegating segmentation to a smaller/purpose-built model?


Post-training allows leveraging the considerable world and language understanding of the underlying pretrained model. Intuition is that this would be a boost to performance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: