Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Side question: what are the current top goto open models for image captioning and building image embeddings dbs, with somewhat reasonable hardware requirements?


For pure image embedding, I find DINOv3 to be quite good. For multimodal embedding, maybe RzenEmbed. For captioning I would use a regular multimodal LLM, Qwen 3 or Gemma 3 or something, if your compute budget allows.


Try any of the qwen3-vl models. They have 8, 4 and 2B models in this family.


I would suggest YOLO. Depending on your domain, you might also finetune these models. Its relativly easy as they are not big LLMs but either image classification or bounding boxes.

I would recommend bounding boxes.


What do you mean "bounding boxes"? They were talking about captions and embeddings, so a vision language model is required.


I suggested YOLO and non llm-vl as a lot faster alternative.

Of course CLIP would be otherwise the other option than a big llm-vl one.


Which YOLO?


Any current one. they are easy to use and you can just benchmark them yourself.

I'm using small and medum.

Also the code for using it is very short and easy to use. You can also use ChatGPT to generate small exepriments to see what fits your case better


There aren’t any YOLO models for captioning and the other models aren’t robust enough to make for good embedding models.


You can get labels out of the classifier and bounding box models.

They are super fast.

Its just an alternative i'm mentioning. I would assume a person knowing a little bit of that domain.

Otherwise the first option would be CLIP i assume. llm-vl is just super slow and compute intensive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: