A detail that is not mentioned is that Google models >= Gemini 2.0 are all expli...

simedw · 2025-07-10T13:38:55 1752154735

That's true, it's also why I didn't benchmark against any other model provider.

It has been tuned so heavily on this specific format that even a tiny change, like switching the order in the `box_2d` format from `(ymin, xmin, ymax, xmax)` to `(xmin, ymin, xmax, ymax)` causes performance to tank.

pbhjpbhj · 2025-07-10T17:53:04 1752169984

That's interesting because it suggests the meaning and representation are very tightly linked; I would expect it to be less tightly coupled given Gemini is multimodal.

xnx · 2025-07-10T13:41:27 1752154887

It's really impressive what Gemini models can do. Segmentation too! https://ai.google.dev/gemini-api/docs/image-understanding#se...

sergiotapia · 2025-07-10T14:46:02 1752158762

this is very cool!

demirbey05 · 2025-07-10T13:39:12 1752154752

I was really shocked when I first see this but yes it's in training data. Not thinking feature.

IncreasePosts · 2025-07-10T18:19:51 1752171591

Why do they do post training instead of just delegating segmentation to a smaller/purpose-built model?

thegeomaster · 2025-07-10T21:16:35 1752182195

Post-training allows leveraging the considerable world and language understanding of the underlying pretrained model. Intuition is that this would be a boost to performance.