Didn't see where you got those numbers, but surely that's just a problem of throwing more compute at it? From the blog post:
> This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.
For the first SAM model, you needed to encode the input image which took about 2 seconds (on a consumer GPU), but then any detection you did on the image was on the order of milliseconds. The blog post doesn't seem too clear on this, but I'm assuming the 30ms is for the encoder+100 runs of the detector.
Even if it was 4s, you can always parallelize the frames to do it “realtime”, just the latency for the output will be 4s (provided you can get a cluster with 120 or 240 GPUs to do 4s of frames going in parallel (if it’s 30ms per image then you only need 2 GPUs to do 60fps on a video stream).
> This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.