Haven't read the paper yet, but curious how different ControlNet is from Text2LIVE ([1], [2]). Seems it's solving the same problem with temporal consistency, no?
No, ControlNet wasn't made to solve temporal consistency, it was made to add more control (hence the name) to image models. I am using it in a way that the authors may not have thought of, because the paper doesn't mention video editing.
[1] https://www.youtube.com/watch?v=8U9o5aZ2y5w
[2] https://text2live.github.io/