This is really awesome. A question for someone who knows more about this:
How much harder would it be to make this work using any number of photos? I'm assuming this is the end goal for a model like this.
Imagine being able to create an accurate enough 3D rendering of any interior with just a bunch of snapshots anyone can take with their phone.
Probably not much harder, but you wouldn't get the same massive jump in quality that you get going from 1 image to 2. NeRF/Gaussian Splatting in general is what you're describing, but from the looks of it, this just does it in a single forward pass rather than optimising the gaussian/network weights.
A lot of splats systems do work on uncalibrated images so that’s not novel either. They all just do a camera solve, which arguable isn’t terrible for a stereo pair with low divergence.
Imagine being able to create an accurate enough 3D rendering of any interior with just a bunch of snapshots anyone can take with their phone.