Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I would suggest YOLO. Depending on your domain, you might also finetune these models. Its relativly easy as they are not big LLMs but either image classification or bounding boxes.

I would recommend bounding boxes.



What do you mean "bounding boxes"? They were talking about captions and embeddings, so a vision language model is required.


I suggested YOLO and non llm-vl as a lot faster alternative.

Of course CLIP would be otherwise the other option than a big llm-vl one.


Which YOLO?


Any current one. they are easy to use and you can just benchmark them yourself.

I'm using small and medum.

Also the code for using it is very short and easy to use. You can also use ChatGPT to generate small exepriments to see what fits your case better


There aren’t any YOLO models for captioning and the other models aren’t robust enough to make for good embedding models.


You can get labels out of the classifier and bounding box models.

They are super fast.

Its just an alternative i'm mentioning. I would assume a person knowing a little bit of that domain.

Otherwise the first option would be CLIP i assume. llm-vl is just super slow and compute intensive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: