Side question: what are the current top goto open models for image captioning an...

daemonologist · 2025-11-25T18:03:22 1764093802

For pure image embedding, I find DINOv3 to be quite good. For multimodal embedding, maybe RzenEmbed. For captioning I would use a regular multimodal LLM, Qwen 3 or Gemma 3 or something, if your compute budget allows.

NitpickLawyer · 2025-11-25T13:44:37 1764078277

Try any of the qwen3-vl models. They have 8, 4 and 2B models in this family.

Glemkloksdjf · 2025-11-25T13:51:53 1764078713

I would suggest YOLO. Depending on your domain, you might also finetune these models. Its relativly easy as they are not big LLMs but either image classification or bounding boxes.

I would recommend bounding boxes.

jabron · 2025-11-25T15:18:48 1764083928

What do you mean "bounding boxes"? They were talking about captions and embeddings, so a vision language model is required.

Glemkloksdjf · 2025-11-25T19:55:17 1764100517

I suggested YOLO and non llm-vl as a lot faster alternative.

Of course CLIP would be otherwise the other option than a big llm-vl one.

smallerize · 2025-11-25T14:02:17 1764079337

Which YOLO?

Glemkloksdjf · 2025-11-25T14:55:51 1764082551

Any current one. they are easy to use and you can just benchmark them yourself.

I'm using small and medum.

Also the code for using it is very short and easy to use. You can also use ChatGPT to generate small exepriments to see what fits your case better

throwaway314155 · 2025-11-25T15:07:56 1764083276

There aren’t any YOLO models for captioning and the other models aren’t robust enough to make for good embedding models.

Glemkloksdjf · 2025-11-25T19:56:44 1764100604

You can get labels out of the classifier and bounding box models.

They are super fast.

Its just an alternative i'm mentioning. I would assume a person knowing a little bit of that domain.

Otherwise the first option would be CLIP i assume. llm-vl is just super slow and compute intensive.