I would suggest YOLO. Depending on your domain, you might also finetune these mo...

jabron · 2025-11-25T15:18:48 1764083928

What do you mean "bounding boxes"? They were talking about captions and embeddings, so a vision language model is required.

Glemkloksdjf · 2025-11-25T19:55:17 1764100517

I suggested YOLO and non llm-vl as a lot faster alternative.

Of course CLIP would be otherwise the other option than a big llm-vl one.

smallerize · 2025-11-25T14:02:17 1764079337

Which YOLO?

Glemkloksdjf · 2025-11-25T14:55:51 1764082551

Any current one. they are easy to use and you can just benchmark them yourself.

I'm using small and medum.

Also the code for using it is very short and easy to use. You can also use ChatGPT to generate small exepriments to see what fits your case better

throwaway314155 · 2025-11-25T15:07:56 1764083276

There aren’t any YOLO models for captioning and the other models aren’t robust enough to make for good embedding models.

Glemkloksdjf · 2025-11-25T19:56:44 1764100604

You can get labels out of the classifier and bounding box models.

They are super fast.

Its just an alternative i'm mentioning. I would assume a person knowing a little bit of that domain.

Otherwise the first option would be CLIP i assume. llm-vl is just super slow and compute intensive.