They are super fast.
Its just an alternative i'm mentioning. I would assume a person knowing a little bit of that domain.
Otherwise the first option would be CLIP i assume. llm-vl is just super slow and compute intensive.