The camera motions do not need to be the same. Gaussian splatting reconstructs the scene in 3d, and you can then render the scene from arbitrary angles, so they just gave it a random camera motion to show you the 3dness of it.
Haven't read the paper yet, but curious how different ControlNet is from Text2LIVE ([1], [2]). Seems it's solving the same problem with temporal consistency, no?
No, ControlNet wasn't made to solve temporal consistency, it was made to add more control (hence the name) to image models. I am using it in a way that the authors may not have thought of, because the paper doesn't mention video editing.
So, the video was generated by applying ControlNet to the input video frame by frame. Every inference setting is the same for every frame -- seed, prompt, CFG, steps, and sampler. The only thing that changes frame to frame is that the pose changes slightly. So actually, if SD was well behaved, you would expect the difference between adjacent frames to be small, because the change in the input is small. But SD is somewhat schizophrenic so you get this amount of flicker from even small changes in input.
I also had to specify what the outfit should be (I got a lot more discrepancies when I didn't do this from the outfit changing frame to frame). You can see that the outfit changes color in the second version, I bet you can get that to be even more consistent if you specify the color in the prompt too.
If you create a dreambooth model of a character, you can probably also get consistency of the face that way. In this case I didn't need to do this because I didn't care who I got, I just asked for an "average woman".
The flickering comes from the fundamental nature of the de-noising mechanism involved in the diffusion model. The ability to create multiple novel images for the same input comes from adding noise with a random seed. Currently this is more or less done every frame which is why you get the flickering. Keeping the same seed wouldn't be helpful if you want the image to move.
What could be of use here is a noise transformation layer that can use the same noise for every frame but transformed to match desired motion. For video conversion you could possibly extract motion vectors from successive frames to warp the noise.
"The flickering comes from the fundamental nature of the de-noising mechanism involved in the diffusion model." -- agreed
"Keeping the same seed wouldn't be helpful if you want the image to move." -- No, I'm using the same seed (and prompt). The image moves because ControlNet opens up another channel of input, in this case the pose data.
Yes but that still produces temporal aliasing because the unmoving noise is battling the moving controlnet input. I can't find it right now but there was a good example showing a gallery of one word prompts with the same seed. While the images were of different subjects you could clearly see the impact of the noise controlling layout. What was was a capital letter A in one image was a persons legs in another, but that same overall structure was visible in the same place in 90% of the images
I wonder if putting an adversary network on top would reduce the flickering. A mechanism that only accepts a frame of it is detected to be three next frame in a video of the same person, otherwise regenerate
that’s really really really good, I have an overtrained Dreambooth model I was using with controlnet and even mine was flickering in the face more than this
Proof of concept for video to video editing. An input video is transformed to an output that matches the pose but with an AI generated character via a prompt. There is some flicker, but the result is much more consistent than existing methods (e.g., image2image).
I wouldn't say that "the opportunity all seems to be on the front end". Specifically for stable diffusion, there are a lot of different ways to use the model. I think we're just starting to scratch the surface of what SD can do, so there is some value in tinkering with different ways to use and apply the model.
Example 2: you can merge different dreambooth models together to varying degrees of success (the idea being, you train model A on subject A, model B on subject B, and now you want to generate pictures of A and B together). My understanding is that this doesn't work too well at the moment, but it's possible that a different interpolation algorithm can yield better results.
I do agree with the general sentiment that you wouldn't necessarily be training your own models or creating your own architecture, just want to provide the perspective that understanding the AI side is valuable because it can lead to different capabilities and products.
I'd say the "opportunity" is primarily in creating business models with these new AIs. The AI field is going to be innovating on them regardless what any individual chooses to do. Discussions about viable applications of these new capabilities are scant, beyond reducing existing technical artist head counts I see little to no discussion about new capabilities and new applications not possible before. Sure, we're developers, but we're also supposed to be entrepreneurs and this lack of creative discussion about what can be done that was not possible before is curious in itself.
Guess based on historical anecdata: I think that is because it looks like the current gen AI:s will help automating lot of stuff, but wont enable anything new. I would say we are at the analogous point where ’computer’ no longer meant a chain of humans with calculators. First the automation needs to become entrenched, then new innovations can emerge.
Can I future quote your "current gen AI's will help automating lot of stuff, but won't enable anything new" when that "something new" tears a new economic hole in our global economy?
Personally, I see this tech capable of destroying and recreating the advertising industry completely via inserting everyday consumers into ad media, depicting them as happy consumers of a product they've not used yet - while celebrity spokespeople, appearing as their personal friend, inform them how much the celebrity idolizes them for using said product. This is an obvious non-subtle application. There will be many, many more.
Current gen is not ready for that. It's ready soon enough for sure. When I said "gen" I meant "not currently" but with this speed of development I'm not ready to bet if the scenario you described is 3 months of 5 years away.
I agree that the current produced images by say SD, are more of a curiosity than true art. Give it a year or so, I would say, and we will change our mind. Remember the early VR models? Not comparable with the quality you have now in real time. Only AI seems to increase in quality of output at a much faster pace.
Hi, I run http://synapticpaint.com/ and using AI image generation for graphic novels/comics is one of the directions I'm exploring. If you're interested in collaborating to make your graphic novel a reality (I'll provide the tooling in return for product feedback), please email me at the email address in my profile! (I poked around on your site but couldn't find an email address.) Thanks!
On AWS, a g5 instance costs $1/hr. I can generate roughly 10 images per minute (should be able to get this down with some optimization), so 600 images per hour, so the cost per image is 1/6 of a cent, before adding overhead (idle time, start up/shut down).
I also offer dreambooth model training for around $2-$4 / model as well as inference on custom dreambooth models. Inference on custom models is where things get a little tricky because if users are using different models and you're loading up new models all the time just to generate 6 images, then that quickly becomes the majority of the work load, drastically pushing up the inference cost. I haven't solved this problem yet. If you have any great ideas, feel free to email me (email in profile)!
"Finally, a vital question is, how will this affect today’s working artists? Here the answer is not so optimistic."
I have a different take on this. I think this technology will allow more people, not less, to make money as a living (so, professionally) in a visual arts related industry. So I'm broadening the field to include not just "artists" but "commercial art" as well (designers, commercial illustrators, video/film post-production, etc.).
The reason is that it changes and lowers the bar to entry for these fields, automates away a lot of the labor intensive work, thereby lowering the cost of production.
Whenever something becomes cheaper (in this case, labor for art), its consumption increases. So in the future, because producing commercial art is so much cheaper, it will be consumed a lot more.
At the same time, we're not at the point where we can actually remove humans entirely from the process. AI generated art is a different process and requires a different skillset, but it still requires skill and learning to do well.
The analogy would be something like a word processor reducing the number of secretaries needed in the workforce, but increasing the number of office workers. People no longer need someone to take notes / dictation, but all kinds of new workflows emerged on top of the technology, and almost all office workers need to know how to use something like a word processor.
Therefore, the opportunity here to do is to build tooling that make it easier and more accessible for more people to work with AI image generation.
Disclaimer: I'm doing exactly that (building tooling to make content generation easier and more accessible) with https://synapticpaint.com/
> Whenever something becomes cheaper (in this case, labor for art), its consumption increases. So in the future, because producing commercial art is so much cheaper, it will be consumed a lot more.
I'm not sure how that would apply here. There's never been a shortage of art. Art has always had more supply than demand, and now we just added even more supply to saturate the market. I was previously a more likely client for an artist than I am now where I can get my computer to spit out any image I want in like 30 seconds. But I have no more desire for art than I did before.
> Whenever something becomes cheaper (in this case, labor for art), its consumption increases. So in the future, because producing commercial art is so much cheaper, it will be consumed a lot more.
I have the opposite view. With lower barrier of entry it will get over-saturated, over-produced and consumers will suffer from content fatigue leading to less interest in AI generated media as a whole.
An analogy is luxury goods. Reducing price of luxury goods decreases demand for them.
It'd be great if that was made easier so that more folks could participate/make a living. Then again, I think that every time that was made simpler (eg. Flash, Dreamweaver/export to html, JQuery, ...) has resulted in a slew of crap.
So: the lower the barrier to entry, the more actual skill/artistry becomes important for a high quality result.
Phrased differently: once the drudge work becomes mechanised, the concept of quality is lifted to a new level. This highlights aspects that used to be stuck in the mud of the drudge work, enabling a more profound understanding... by those with the necessary skills to do so.
> I think this technology will allow more people, not less, to make money as a living (so, professionally) in a visual arts related industry.
> ...
> Whenever something becomes cheaper (in this case, labor for art), its consumption increases.
But not its price, and definitely not the compensation for the labour to produce it.
Making a living as a mediocre-to-good artist is already incredibly difficult; increasing the supply of poor-to-good artists through AI-assistance isn't going to make it any easier.
> The analogy would be something like a word processor reducing the number of secretaries needed in the workforce, but increasing the number of office workers.
Only if the word processor wrote documents without the assistance of a typist, or an author.
NFTs notwithstanding, people paying big money for art aren't usually just "buying" .JPG files. There's often some physical component, sometimes quite large, heavy and difficult to reproduce. Oil paintings, castings, mixed-media pieces larger than a car, etc.
These sorts of things are less likely to be subject to someone typing a command into an AI generator program. AI stuff will have impacts on grunt work, and will be used to come up with crazy ideas when artists are stuck, and there will be some people who will make a lot of money with a line of text and some software, but overall, the art world values uniqueness, the human touch or at least the human conceit that goes into "computer art" or "machine art".
Art is only worth what people (real live humans) will pay for it. The NFT market has finally started to crater as people have realized there's nothing there there.
"Working artists" (lol) are gonna be just as subservient with or without 2D typewriters, because the "creative industries" are to art basically what porn is to sex. (Anglophones have this linguisic quirk where half of the time "art" is synonymous with "graphics", which to other languages sounds like trying to tie your shoelaces with one hand fused to your face.)
Truth be told, I'm yet to see an AI-generated or AI-assisted work that provokes any internal experience other than thinking "hey, cool pixels/sounds/sentences/whatever". Not even the usual "somebody paid a lot of money to make this thing, so I better pay attention" which is the official function of commercial art.
It's certainly a very interesting academic exercise, and possibly a lucrative line of business. But if the value proposition is supposed to be "now it's easier for more people to create more complex stuff" (i.e. operate at a higher level of abstraction), why do I keep finding many "manually produced" works that speak to me on some level, even among the shitfountain of mass culture - while no AI-made thing has yet got me thinking anything other than "hey, cool tech?"
I'll start worrying when an AI chooses to ignore its incentives and, instead of doing something more rewarding, goes on to create a work of art just because. I'll also start worrying when people start becoming unreceptive to non-AI art, which I think is more likely to happen within our lifetimes.
Because what I see here is some cool tech for creating some ersatz sensory stimuli. And obviously, with enough compute you can make this advanced enough to confidently supplant the previous generation of tech for creating ersatz sensory stimuli. Somewhat less obviously, this reduces the risk of a spontaneous transcendental experience when perceiving AI art to a safe margin. Which is... probably good for business somehow?
At the end of the day, the human sensory system has a finite complexity, so you can create more and more compelling simulacra. Have at it. What'll happen is the next generation of humans will grow up awfully prone to Wile E. Coyote moments - tragic as well as comic.
>I'm yet to see an AI-generated or AI-assisted work that provokes any internal experience other than thinking "hey, cool pixels/sounds/sentences/whatever"
Maybe because you knew it was AI generated before you looked at it?
Or because I knew it was AI generated when I looked at it. Either way, that shouldn't be the differentiator.
An artwork is judged by how it exists within the totality of the context. We can consider "art" in the broadest sense, as "artifice" - and from that standpoint the algos themselves can be seen as staggering works of art in their own right. But on the other hand if we view the algos strictly as tools, and only consider individual pieces of content that are created through them, so far it's been only "meh" with a very occasional "hmm".
A similar example is the audio gear scene. There are many electromusical devices that are true works of art, worth of historical study even - then someone picks em up and starts by making bleeps and bloops that may have some novelty value but don't create a lasting impression.
I try to keep an open mind - even Facebook can be seen as AI-facilitated performance art evoking a profound feeling of dread, but that's a much broader context.
I don't think that the profile pic/avatar space is going to be very profitable. As you can see from this post, there is quite a bit of competition already. Because this is not that hard to set up, there is nothing stopping someone from charging the cost of production for this, which is essentially what I'm doing (I added in a small amount of overhead as a buffer for non-GPU compute and in case I messed up my math, math is hard). That's why I'm focused on building for other use cases instead (making content creation more accessible).