I would recommend bounding boxes.
Of course CLIP would be otherwise the other option than a big llm-vl one.
I'm using small and medum.
Also the code for using it is very short and easy to use. You can also use ChatGPT to generate small exepriments to see what fits your case better
They are super fast.
Its just an alternative i'm mentioning. I would assume a person knowing a little bit of that domain.
Otherwise the first option would be CLIP i assume. llm-vl is just super slow and compute intensive.
I would recommend bounding boxes.