> Apparently there’s more cause, effect, and sequencing in diffusion models than...

dewarrn1 · on Aug 28, 2024

Great observation. And not entirely unlike normal human visual perception which is notoriously vulnerable to missing highly salient information; I'm reminded of the "gorillas in our midst" work by Dan Simons and Christopher Chabris [0].

[0]: https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...

lawlessone · on Aug 28, 2024

I reminds me of dreaming. When you do something and turn back to check it has turned into something completely different.

edit: someone should train it on MyHouse.wad

robotresearcher · on Aug 28, 2024

Not noticing to a gorilla that ‘shouldn’t’ be there is not the same thing as object permanence. Even quite young babies are surprised by objects that go missing.

dewarrn1 · on Aug 28, 2024

That's absolutely true. It's also well-established by Simons et al. and others that healthy normal adults maintain only a very sparse visual representation of their surroundings, anchored but not perfectly predicted by attention, and this drives the unattended gorilla phenomenon (along with many others). I don't work in this domain, but I would suggest that object permanence probably starts with attending and perceiving an object, whereas the inattentional or change blindness phenomena mostly (but not exclusively) occur when an object is not attended (or only briefly attended) or attention is divided by some competing task.

bamboozled · on Aug 28, 2024

Are you saying if I turn around, I’ll be surprised at what I find ? I don’t feel like this is accurate at all.

dewarrn1 · on Aug 28, 2024

Not exactly, but our representation of what's behind us is a lot more sparse than we would assume. That is, I might not be surprised by what I see when I turn around, but it could have changed pretty radically since I last looked, and I might not notice. In fact, an observer might be quite surprised that I missed the change.

Objectively, Simons and Chabris (and many others) have a lot of data to support these ideas. Subjectively, I can say that these types of tasks (inattentional blindness, change blindness, etc.) are humbling.

jerf · on Aug 28, 2024

Well, it's a bit of a spoiler to encounter this video in this context, but this is a very good video: https://www.youtube.com/watch?v=LRFMuGBP15U

Even having a clue why I'm linking this, I virtually guarantee you won't catch everything.

And even if you do catch everything... the real thing to notice is that you had to look. Your brain does not flag these things naturally. Dreams are notorious for this sort of thing, but even in the waking world your model of the world is much less rich than you think. Magic tricks like to hide in this space, for instance.

dewarrn1 · on Aug 28, 2024

Yup, great example! Simons's lab has done some things along exactly these lines [0], too.

[0]: https://www.youtube.com/watch?v=wBoMjORwA-4

ajuc · on Aug 28, 2024

The opposite - if you turn around and there's something that wasn't there the last time - you'll likely not notice if it's not out of place. You'll just assume it was there and you weren't paying attention.

We don't memorize things that the environment remembers for us if they aren't relevant for other reasons.

bamboozled · on Aug 29, 2024

We also don't just work on images. We work on a lot of sensory data. So i think images of the environment are just one part of it.

matheusd · on Aug 28, 2024

If a generic human glances at an unfamiliar screen/wall/room, can they accurately, pixel-perfectly reconstruct every single element of it? Can they do it for every single screen they have seen in their entire lives?

bamboozled · on Aug 28, 2024

I never said pixel perfect, but I would be surprised if whole objects , like flaming lanterns suddenly appeared.

What this demo demonstrates to me is how incredible willing we are to accept what seems familiar to us as accurate.

I bet if you look closely and objectively you will see even more anomalies. But at first watch, I didn’t see most errors because I think accepting something is more efficient for the brain.

ben_w · on Aug 28, 2024

You'd likely be surprised by a flaming lantern unless you were in Flaming Lanterns 'R Us, but if you were watching a video of a card trick and the two participants changed clothes while the camera wasn't focused on them, you may well miss that and the other five changes that came with that.

throwway_278314 · on Aug 28, 2024

Work which exaggerates the blindness.

The people were told to focus very deeply on a certain aspect of the scene. Maintaining that focus means explicitly blocking things not related to that focus. Also, there is social pressure at the end to have peformed well at the task; evaluating them on a task which is intentionally completely different than the one explicitly given is going to bias people away from reporting gorillas.

And also, "notice anything unusual" is a pretty vague prompt. No-one in the video thought the gorillas were unusual, so if the PEOPLE IN THE SCENE thought gorillas were normal, why would I think they were strange? Look at any TV show, they are all full of things which are pretty crazy unusual in normal life, yet not unusual in terms of the plot.

Why would you think the gorillas were unusual?

dewarrn1 · on Aug 28, 2024

I understand what you mean. I believe that the authors would contend that what you're describing is a typical attentional state for an awake/aware human: focused mostly on one thing, and with surprisingly little awareness of most other things (until/unless they are in turn attended).

Furthermore, even what we attend to isn't always represented with all that much detail. Simons has a whole series of cool demonstration experiments where they show that they can swap out someone you're speaking with (an unfamiliar conversational partner like a store clerk or someone asking for directions), and you may not even notice [0]. It's rather eerie.

[0]: https://www.youtube.com/watch?v=FWSxSQsspiQ&t=5s

InDubioProRubio · on Aug 29, 2024

Does that work on autistic people? Having no filters or fewer filters, should allow them to be more efficient "on guard duty" looking for unexpected things.

nmstoker · on Aug 28, 2024

I saw a longer video of this that Ethan Mollick posted and in that one, the sequences are longer and they do appear to demonstrate a fair amount of consistency. The clips don't backtrack in the summary video on the paper's home page because they're showing a number of district environments but you only get a few seconds of each.

If I studied the longer one more closely, I'm sure inconsistencies would be seen but it seemed able to recall presence/absence of destroyed items, dead monsters etc on subsequent loops around a central obstruction that completely obscured them for quite a while. This did seem pretty odd to me, as I expected it to match how you'd described it.

wavemode · on Aug 28, 2024

Yes it definitely is very good for simulating gameplay footage, don't get me wrong. Its input for predicting the next frame is not just the previous frame, it has access to a whole sequence of prior frames.

But to say the model is simulating actual gameplay (i.e. that a person could actually play Doom in this) is far fetched. It's definitely great that the model was able to remember that the gray wall was still there after we turned around, but it's untenable for actual gameplay that the wall completely changed location and orientation.

TeMPOraL · on Aug 28, 2024

> it's untenable for actual gameplay that the wall completely changed location and orientation.

It would in an SCP-themed game. Or dreamscape/Inception themed one.

Hell, "you're trapped in Doom-like dreamscape, escape before you lose your mind" is a very interesting pitch for a game. Basically take this Doom thing and make walking though a specific, unique-looking doorway from the original game to be the victory condition - the player's job would be to coerce the model to generate it, while also not dying in the Doom fever dream game itself. I'd play the hell out of this.

(Implementation-wise, just loop in a simple recognition model to continously evaluate victory condiiton from last few frames, and some OCR to detect when player's hit points indicator on the HUD drops to zero.)

(I'll happily pay $100 this year to the first project that gets this to work. I bet I'm not the only one. Doesn't have to be Doom specifically, just has to be interesting.)

kridsdale1 · on Aug 28, 2024

Check out the actual modern DOOM WAD MyHouse which implements these ideas. It totally breaks our preconceptions of what the DOOM engine is capable of.

https://en.wikipedia.org/wiki/MyHouse.wad

jsheard · on Aug 28, 2024

MyHouse is excellent, but it mostly breaks our perception of what the Doom engine is capable of by not really using the Doom engine. It leans heavily on engine features which were embellishments by the GZDoom project, and never existed in the original Doom codebase.

wavemode · on Aug 28, 2024

To be honest, I agree! That would be an interesting gameplay concept for sure.

Mainly just wanted to temper expectations I'm seeing throughout this thread that the model is actually simulating Doom. I don't know what will be required to get from here to there, but we're definitely not there yet.

KajMagnus · on Aug 28, 2024

Or if training the model on many FPS games? Surviving in one nightmare that morphs into another, into another, into another ...

ValentinA23 · on Aug 28, 2024

What you're pointing at mirrors the same kind of limitation in using LLMs for role-play/interactive fictions.

lawlessone · on Aug 28, 2024

Maybe a hybrid approach would work. Certain things like inventory being stored as variables, lists etc.

Wouldn't be as pure though.

crooked-v · on Aug 28, 2024

Give it state by having a rendered-but-offscreen pixel area that's fed back in as byte data for the next frame.

TeMPOraL · on Aug 29, 2024

Huh.

Fun variant: give it hidden state by doing the offscreen scratch pixel buffer thing, but not grading its content in training. Train the model as before, grading on the "onscreen" output, and let it keep the side channel to do what it wants with. It'd be interesting to see what way it would use it, what data it would store, and how it would be encoded.

dr_dshiv · on Aug 28, 2024

It's an empirical question, right? But they didn't do it...

whiteboardr · on Aug 28, 2024

But does it need to be frame-based?

What if you combine this with an engine in parallel that provides all geometry including characters and objects with their respective behavior, recording changes made through interactions the other model generates, talking back to it?

A dialogue between two parties with different functionality so to speak.

(Non technical person here - just fantasizing)

robotresearcher · on Aug 28, 2024

In that scheme what is the NN providing that a classical renderer would not? DOOM ran great on an Intel 486, which is not a lot of computer.

Sohcahtoa82 · on Aug 28, 2024

> DOOM ran great on an Intel 486

It always blew my mind how well it worked on a 33 Mhz 486. I'm fairly sure it ran at 30 fps in 320x200. That gives it just over 17 clock cycles per pixel, and that doesn't even include time for game logic.

My memory could be wrong, though, but even if it required a 66 Mhz to reach 30 fps, that's still only 34 clocks per pixel on an architecture that required multiple clocks for a simple integer add instruction.

whiteboardr · on Aug 28, 2024

An experience that isn’t asset- but rule-based.

bee_rider · on Aug 28, 2024

In that case, the title of the article wouldn’t be true anymore. It seems like a better plan, though.

beepbooptheory · on Aug 28, 2024

What would the model provide if not what we see on the screen?

whiteboardr · on Aug 28, 2024

The environment and everything in it.

“Everything” would mean all objects and the elements they’re made of, their rules on how they interact and decay.

A modularized ecosystem i guess, comprised of “sub-systems” of sorts.

The other model, that provides all interaction (cause for effect) could either be run artificially or be used interactively by a human - opening up the possibility for being a tree : )

This all would need an interfacing agent that in principle would be an engine simulating the second law of thermodynamics and at the same time recording every state that has changed and diverged off the driving actor’s vector in time.

Basically the “effects” model keeping track of everyones history.

In the end a system with an “everything” model (that can grow overtime), a “cause” model messing with it, brought together and documented by the “effect” model.

(Again … non technical person, just fantasizing) : )

BizarroLand · on Aug 30, 2024

For instance, for a generated real world RPG, one process could create the planet, one could create the city where the player starts, one could create the NPCs, one could then model the relationships of the npcs with each other. Each one building off of the other so that the whole thing feels nuanced and more real.

Repeat for quest lines, new cities, etc, with the npcs having real time dialogue and interactions that happen entirely off screen, no guarantee of there being a massive quest objective, and some sort of recorder of events that keeps a running tally of everything that goes on so that as the PCs interact with it they are never repeating the same dreary thing.

If this were a MMORPG it would require so much processing and architecting, but it would have the potential to be the greatest game in human history.

mplewis · on Aug 28, 2024

What you’re asking for doesn’t make sense.

HappMacDonald · on Aug 28, 2024

So you're basically just talking about upgrading "enemy AI" to a more complex form of AI :)

mensetmanusman · on Aug 28, 2024

That is kind of cool though, I would play like being lost in a dream.

If on the backend you could record the level layouts in memory you could have exploration teams that try to find new areas to explore.

debo_ · on Aug 28, 2024

It would be cool for dream sequences in games to feel more like dreams. This is probably an expensive way to do it, but it would be neat!

codeflo · on Aug 28, 2024

Even purely going forward, specks on wall textures morph into opponents and so on. All the diffusion-generated videos I’ve seen so far have this kind of unsettling feature.

bee_rider · on Aug 28, 2024

It it like some kind of weird dream doom.

hoosieree · on Aug 28, 2024

Small objects like powerups appear and disappear as the player moves (even without backtracking), the ammo count is constantly varying, getting shot doesn't deplete health or armor, etc.

TeMPOraL · on Aug 28, 2024

So for the next iteration, they should add a minimap overlay (perhaps on a side channel) - it should help the model give more consistent output in any given location. Right now, the game is very much like a lucid dream - the universe makes sense from moment to moment, but without outside reference, everything that falls out of short-term memory (few frames here) gets reimagined.

Groxx · on Aug 28, 2024

There's an example right at the beginning too - the ammo drop on the right changes to something green (I think that's a body?)

Workaccount2 · on Aug 28, 2024

I don't see this as something that would be hard to overcome. Sora for instance has already shown the ability for a diffusion model to maintain object permanence. Flux recently too has shown the ability to render the same person in many different poses or images.

idunnoman1222 · on Aug 28, 2024

Where does a sora video turn around backwards? I can’t maintain such consistency in my own dreams.

Workaccount2 · on Aug 28, 2024

I don't know of an example (not to say it doesn't exist) but the problem is fundamentally the same as things moving out of sight/out of frame and coming back again.

Jensson · on Aug 28, 2024

> the problem is fundamentally the same as things moving out of sight/out of frame and coming back again

Maybe it is, but doing that with the entire scene instead of just a small part of it makes the problem massively harder, as the model needs to grow exponentially to remember more things. It isn't something that we will manage anytime soon, maybe 10-20 years with current architecture and same compute progress.

Then you make that even harder by remembering a whole game level? No, ain't gonna happen in our lifetimes without massive changes to the architecture. They would need to make a different model keep track of level state etc, not just an image to image model.

Workaccount2 · on Aug 28, 2024

10 to 20 years sounds wildly pessimistic

In this sora video the dragon covers half the scene, and its basically identical when it is revealed again ~5 seconds later, or about 150 frames later. The is lots of evidence (and some studies) that these models are in fact building internal world models.

https://www.youtube.com/watch?v=LXJ-yLiktDU

Buckle in, the train is moving way faster. I don't think there would be much surprise if this is solved in the next few generations of video generators. The first generation is already doing very well.

Jensson · on Aug 28, 2024

Did you watch the video, it is completely different after the dragon goes past? Its still a flag there, but everything else changed. Even the stores in the background changed, the mass of people is completely different with no hint of anyone moving there etc.

You always get this from AI enthusiast, they come and post "proof" that disproves their own point.

HappMacDonald · on Aug 28, 2024

I'm not GP, but running over that video I'm actually having a hard time finding any detail present before the dragon obscures them not either exit frame right when the camera pans left slightly near the end or not re-appear with reasonably crisp detail after the dragon gets out of the way.

Most of the mob of people are indistinct, but there is a woman in a lime green coat who is visible, and then obstructed by the dragon twice (beard and ribbon) and reappears fine. Unfortunately when dragon fully moves past she has been lost to frame right.

There is another person in black holding a red satchel which is visible both before and after the dragon has passed.

Nothing about the storefronts appear to change. The complex sign full of Chinese text (which might be gibberish text: it's highly stylized and I don't know Chinese) appears to survive the dragon passing without even any changes to the individual ideograms.

There is also a red box shaped like a Chinese paper lantern with a single gold ideogram on it at the store entrance which spends most of the video obscured by the dragon and is still in the same location after it passes (though video artifacting makes it more challenging to verify that that ideogram is unchanged it certainly does not appear substantially different)

What detail are you seeing that is different before and after the obstruction?

Jensson · on Aug 29, 2024

> What detail are you seeing that is different before and after the obstruction?

First frame, guy in blue hat next to a flag. That flag and the guy is then gone afterwards.

The two flags near the wall are gone, there is something triangular there but there was two flags before the dragon went past.

Then not to mention that the crowd is 6 people deep after the dragon went past, while just 4 people deep before, it is way more crowded.

Instead of the flag that was there before the dragon, it put in 2 more flags afterwards far more to the left.

Third second a guy was out of frame for a few frames, and suddenly gained a blue scarf. AFter dragon went by he turned into a woman. Next to that person was a guy with a blue cap, he completely disappears.

> Most of the mob of people are indistinct

No they aren't, they are mostly distinct and basically all of them changes. If you ignore that the entire mob totally changes both in number and appearance and where it is, sure it is pretty good, except it forgot the flags, but how can you ignore the mob when we talk about the model remembering details? The wall is much less information dense than the mob, so that is much easier to remember for the model, the difficulty is in the mob.

> but there is a woman in a lime green coat who is visible,

She was just out of frame for a fraction of a second, not the big bit where the dragon moves past. The guy in blue jacket and blue cap behind her disappears though, or merges with another person and becomes a woman with a muffler after the dragon moved past.

So, in the end some big strokes were kept, and that was a very tiny part of the image that was both there before and after the dragon moved past so it was far from a whole image with full details. Almost all details are wrong.

Maybe he meant that the house looked mostly the same, I agree the upper parts does, but I looked at the windows and they were completely different, it is full of people heads after the dragon moved past while before it was just clean walls.

Workaccount2 · on Aug 29, 2024

We are looking at first generation tech and pretty much every human would recognize the "before dragon" scene as being the same place as the "after dragon" scene. The prominent features are present. The model clearly shows the ability to go beyond "image-to-image" rendering.

If you want to be right because you can find any difference. Sure. You win. But also completely missed the point.

Jensson · on Aug 29, 2024

> pretty much every human would recognize the "before dragon" scene as being the same place as the "after dragon" scene

Not in a game and those were enemies, it completely changed what and how many they are, people would notice such a massive change instantly if they looked away and suddenly there were 50% more enemies.

> The model clearly shows the ability to go beyond "image-to-image" rendering.

I never argued against that. Adding a third dimension (time) makes generating a video the same kind of problem as generating an image, it is not harder to draw a straight pencil with something covering it than to draw the scene with something covering it for a while.

But still, even though it is that simple, these models are really bad at it, because it requires very large models and much compute. So I just extrapolated based on their current abilities that we know, as you demonstrated there, to say roughly how long until we can even have consistent short videos.

Note that videos wont have the same progression as images, as the early image models were very small and we quickly scaled up there, while now for video we start at really scaled up models and we have to wait until compute gets cheaper/faster the slow way.

> But also completely missed the point.

You completely missed my point or you changed your point afterwards. My point was that current models can only remember little bits under such circumstances, and to remember a whole scene they need to be massively larger. Almost all details in the scene you showed were missed, the large strokes are there but to keep the details around you need an exponentially larger model.

idunnoman1222 · on Aug 28, 2024

Where does a sora video turn around backwards? I don’t even maintain such consistency in my dreams.

nielsbot · on Aug 28, 2024

You can also notice in the first part of the video the ammo numbers fluctuate a bit randomly.

alickz · on Aug 28, 2024

is that something that can be solved with more memory/attention/context?

or do we believe it's an inherent limitation in the approach?

noiv · on Aug 28, 2024

I think the real question is does the player get shot from behind?

alickz · on Aug 28, 2024

great question

tangentially related but Grand Theft Auto speedrunners often point the camera behind them while driving so cars don't spawn "behind" them (aka in front of the car)