Didn't see where you got those numbers, but surely that's just a problem of thro...

v9v · 2025-11-20T08:24:45 1763627085

For the first SAM model, you needed to encode the input image which took about 2 seconds (on a consumer GPU), but then any detection you did on the image was on the order of milliseconds. The blog post doesn't seem too clear on this, but I'm assuming the 30ms is for the encoder+100 runs of the detector.

vlovich123 · 2025-11-20T04:21:48 1763612508

Even if it was 4s, you can always parallelize the frames to do it “realtime”, just the latency for the output will be 4s (provided you can get a cluster with 120 or 240 GPUs to do 4s of frames going in parallel (if it’s 30ms per image then you only need 2 GPUs to do 60fps on a video stream).