Side question: what are the current top goto open models for image captioning and building image embeddings dbs, with somewhat reasonable hardware requirements?
For pure image embedding, I find DINOv3 to be quite good. For multimodal embedding, maybe RzenEmbed. For captioning I would use a regular multimodal LLM, Qwen 3 or Gemma 3 or something, if your compute budget allows.
I would suggest YOLO. Depending on your domain, you might also finetune these models. Its relativly easy as they are not big LLMs but either image classification or bounding boxes.