I know it may be not what you are looking for, but most of such models generate multiple-scale image features through an image encoder, and those can be very easily fine-tuned for a particular task, like some polygon prediction for your use case. I understand the main benefit of such promptable models to reduce/remove this kind of work in the first place, but could be worth and much more accurate if you have a specific high-load task !