U-net is a brilliant architecture, and it still seems to beat this model in scal...

U-net is a brilliant architecture, and it still seems to beat this model in scaling up the segmentation mask from 256x256 back to the real image. I also don't think unet really benefits from the massive internal feature size given by a the visual transformer used for image encoding.

But I'm impressed by the ability of this model to create a image encoding that is independent of the prompt. I feel like there may be lessons in training approach that can be carried over to unet for a more valuable encoding.