Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected

To temper this a bit, you may want to pay close attention to the demo videos. The player rarely backtracks, and for good reason - the few times the character does turn around and look back at something a second time, it has changed significantly (the most noticeable I think is the room with the grey wall and triangle sign).

This falls in line with how we'd expect a diffusion model to behave - it's trained on many billions of frames of gameplay, so it's very good at generating a plausible -next- frame of gameplay based on some previous frames. But it doesn't deeply understand logical gameplay constraints, like remembering level geometry.



Great observation. And not entirely unlike normal human visual perception which is notoriously vulnerable to missing highly salient information; I'm reminded of the "gorillas in our midst" work by Dan Simons and Christopher Chabris [0].

[0]: https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...


I reminds me of dreaming. When you do something and turn back to check it has turned into something completely different.

edit: someone should train it on MyHouse.wad


Not noticing to a gorilla that ‘shouldn’t’ be there is not the same thing as object permanence. Even quite young babies are surprised by objects that go missing.


That's absolutely true. It's also well-established by Simons et al. and others that healthy normal adults maintain only a very sparse visual representation of their surroundings, anchored but not perfectly predicted by attention, and this drives the unattended gorilla phenomenon (along with many others). I don't work in this domain, but I would suggest that object permanence probably starts with attending and perceiving an object, whereas the inattentional or change blindness phenomena mostly (but not exclusively) occur when an object is not attended (or only briefly attended) or attention is divided by some competing task.


Are you saying if I turn around, I’ll be surprised at what I find ? I don’t feel like this is accurate at all.


Not exactly, but our representation of what's behind us is a lot more sparse than we would assume. That is, I might not be surprised by what I see when I turn around, but it could have changed pretty radically since I last looked, and I might not notice. In fact, an observer might be quite surprised that I missed the change.

Objectively, Simons and Chabris (and many others) have a lot of data to support these ideas. Subjectively, I can say that these types of tasks (inattentional blindness, change blindness, etc.) are humbling.


Well, it's a bit of a spoiler to encounter this video in this context, but this is a very good video: https://www.youtube.com/watch?v=LRFMuGBP15U

Even having a clue why I'm linking this, I virtually guarantee you won't catch everything.

And even if you do catch everything... the real thing to notice is that you had to look. Your brain does not flag these things naturally. Dreams are notorious for this sort of thing, but even in the waking world your model of the world is much less rich than you think. Magic tricks like to hide in this space, for instance.


Yup, great example! Simons's lab has done some things along exactly these lines [0], too.

[0]: https://www.youtube.com/watch?v=wBoMjORwA-4


The opposite - if you turn around and there's something that wasn't there the last time - you'll likely not notice if it's not out of place. You'll just assume it was there and you weren't paying attention.

We don't memorize things that the environment remembers for us if they aren't relevant for other reasons.


We also don't just work on images. We work on a lot of sensory data. So i think images of the environment are just one part of it.


If a generic human glances at an unfamiliar screen/wall/room, can they accurately, pixel-perfectly reconstruct every single element of it? Can they do it for every single screen they have seen in their entire lives?


I never said pixel perfect, but I would be surprised if whole objects , like flaming lanterns suddenly appeared.

What this demo demonstrates to me is how incredible willing we are to accept what seems familiar to us as accurate.

I bet if you look closely and objectively you will see even more anomalies. But at first watch, I didn’t see most errors because I think accepting something is more efficient for the brain.


You'd likely be surprised by a flaming lantern unless you were in Flaming Lanterns 'R Us, but if you were watching a video of a card trick and the two participants changed clothes while the camera wasn't focused on them, you may well miss that and the other five changes that came with that.


Work which exaggerates the blindness.

The people were told to focus very deeply on a certain aspect of the scene. Maintaining that focus means explicitly blocking things not related to that focus. Also, there is social pressure at the end to have peformed well at the task; evaluating them on a task which is intentionally completely different than the one explicitly given is going to bias people away from reporting gorillas.

And also, "notice anything unusual" is a pretty vague prompt. No-one in the video thought the gorillas were unusual, so if the PEOPLE IN THE SCENE thought gorillas were normal, why would I think they were strange? Look at any TV show, they are all full of things which are pretty crazy unusual in normal life, yet not unusual in terms of the plot.

Why would you think the gorillas were unusual?


I understand what you mean. I believe that the authors would contend that what you're describing is a typical attentional state for an awake/aware human: focused mostly on one thing, and with surprisingly little awareness of most other things (until/unless they are in turn attended).

Furthermore, even what we attend to isn't always represented with all that much detail. Simons has a whole series of cool demonstration experiments where they show that they can swap out someone you're speaking with (an unfamiliar conversational partner like a store clerk or someone asking for directions), and you may not even notice [0]. It's rather eerie.

[0]: https://www.youtube.com/watch?v=FWSxSQsspiQ&t=5s


Does that work on autistic people? Having no filters or fewer filters, should allow them to be more efficient "on guard duty" looking for unexpected things.


I saw a longer video of this that Ethan Mollick posted and in that one, the sequences are longer and they do appear to demonstrate a fair amount of consistency. The clips don't backtrack in the summary video on the paper's home page because they're showing a number of district environments but you only get a few seconds of each.

If I studied the longer one more closely, I'm sure inconsistencies would be seen but it seemed able to recall presence/absence of destroyed items, dead monsters etc on subsequent loops around a central obstruction that completely obscured them for quite a while. This did seem pretty odd to me, as I expected it to match how you'd described it.


Yes it definitely is very good for simulating gameplay footage, don't get me wrong. Its input for predicting the next frame is not just the previous frame, it has access to a whole sequence of prior frames.

But to say the model is simulating actual gameplay (i.e. that a person could actually play Doom in this) is far fetched. It's definitely great that the model was able to remember that the gray wall was still there after we turned around, but it's untenable for actual gameplay that the wall completely changed location and orientation.


> it's untenable for actual gameplay that the wall completely changed location and orientation.

It would in an SCP-themed game. Or dreamscape/Inception themed one.

Hell, "you're trapped in Doom-like dreamscape, escape before you lose your mind" is a very interesting pitch for a game. Basically take this Doom thing and make walking though a specific, unique-looking doorway from the original game to be the victory condition - the player's job would be to coerce the model to generate it, while also not dying in the Doom fever dream game itself. I'd play the hell out of this.

(Implementation-wise, just loop in a simple recognition model to continously evaluate victory condiiton from last few frames, and some OCR to detect when player's hit points indicator on the HUD drops to zero.)

(I'll happily pay $100 this year to the first project that gets this to work. I bet I'm not the only one. Doesn't have to be Doom specifically, just has to be interesting.)


Check out the actual modern DOOM WAD MyHouse which implements these ideas. It totally breaks our preconceptions of what the DOOM engine is capable of.

https://en.wikipedia.org/wiki/MyHouse.wad


MyHouse is excellent, but it mostly breaks our perception of what the Doom engine is capable of by not really using the Doom engine. It leans heavily on engine features which were embellishments by the GZDoom project, and never existed in the original Doom codebase.


To be honest, I agree! That would be an interesting gameplay concept for sure.

Mainly just wanted to temper expectations I'm seeing throughout this thread that the model is actually simulating Doom. I don't know what will be required to get from here to there, but we're definitely not there yet.


Or if training the model on many FPS games? Surviving in one nightmare that morphs into another, into another, into another ...


What you're pointing at mirrors the same kind of limitation in using LLMs for role-play/interactive fictions.


Maybe a hybrid approach would work. Certain things like inventory being stored as variables, lists etc.

Wouldn't be as pure though.


Give it state by having a rendered-but-offscreen pixel area that's fed back in as byte data for the next frame.


Huh.

Fun variant: give it hidden state by doing the offscreen scratch pixel buffer thing, but not grading its content in training. Train the model as before, grading on the "onscreen" output, and let it keep the side channel to do what it wants with. It'd be interesting to see what way it would use it, what data it would store, and how it would be encoded.


It's an empirical question, right? But they didn't do it...


But does it need to be frame-based?

What if you combine this with an engine in parallel that provides all geometry including characters and objects with their respective behavior, recording changes made through interactions the other model generates, talking back to it?

A dialogue between two parties with different functionality so to speak.

(Non technical person here - just fantasizing)


In that scheme what is the NN providing that a classical renderer would not? DOOM ran great on an Intel 486, which is not a lot of computer.


> DOOM ran great on an Intel 486

It always blew my mind how well it worked on a 33 Mhz 486. I'm fairly sure it ran at 30 fps in 320x200. That gives it just over 17 clock cycles per pixel, and that doesn't even include time for game logic.

My memory could be wrong, though, but even if it required a 66 Mhz to reach 30 fps, that's still only 34 clocks per pixel on an architecture that required multiple clocks for a simple integer add instruction.


An experience that isn’t asset- but rule-based.


In that case, the title of the article wouldn’t be true anymore. It seems like a better plan, though.


What would the model provide if not what we see on the screen?


The environment and everything in it.

“Everything” would mean all objects and the elements they’re made of, their rules on how they interact and decay.

A modularized ecosystem i guess, comprised of “sub-systems” of sorts.

The other model, that provides all interaction (cause for effect) could either be run artificially or be used interactively by a human - opening up the possibility for being a tree : )

This all would need an interfacing agent that in principle would be an engine simulating the second law of thermodynamics and at the same time recording every state that has changed and diverged off the driving actor’s vector in time.

Basically the “effects” model keeping track of everyones history.

In the end a system with an “everything” model (that can grow overtime), a “cause” model messing with it, brought together and documented by the “effect” model.

(Again … non technical person, just fantasizing) : )


For instance, for a generated real world RPG, one process could create the planet, one could create the city where the player starts, one could create the NPCs, one could then model the relationships of the npcs with each other. Each one building off of the other so that the whole thing feels nuanced and more real.

Repeat for quest lines, new cities, etc, with the npcs having real time dialogue and interactions that happen entirely off screen, no guarantee of there being a massive quest objective, and some sort of recorder of events that keeps a running tally of everything that goes on so that as the PCs interact with it they are never repeating the same dreary thing.

If this were a MMORPG it would require so much processing and architecting, but it would have the potential to be the greatest game in human history.


What you’re asking for doesn’t make sense.


So you're basically just talking about upgrading "enemy AI" to a more complex form of AI :)


That is kind of cool though, I would play like being lost in a dream.

If on the backend you could record the level layouts in memory you could have exploration teams that try to find new areas to explore.


It would be cool for dream sequences in games to feel more like dreams. This is probably an expensive way to do it, but it would be neat!


Even purely going forward, specks on wall textures morph into opponents and so on. All the diffusion-generated videos I’ve seen so far have this kind of unsettling feature.


It it like some kind of weird dream doom.


Small objects like powerups appear and disappear as the player moves (even without backtracking), the ammo count is constantly varying, getting shot doesn't deplete health or armor, etc.


So for the next iteration, they should add a minimap overlay (perhaps on a side channel) - it should help the model give more consistent output in any given location. Right now, the game is very much like a lucid dream - the universe makes sense from moment to moment, but without outside reference, everything that falls out of short-term memory (few frames here) gets reimagined.


There's an example right at the beginning too - the ammo drop on the right changes to something green (I think that's a body?)


I don't see this as something that would be hard to overcome. Sora for instance has already shown the ability for a diffusion model to maintain object permanence. Flux recently too has shown the ability to render the same person in many different poses or images.


Where does a sora video turn around backwards? I can’t maintain such consistency in my own dreams.


I don't know of an example (not to say it doesn't exist) but the problem is fundamentally the same as things moving out of sight/out of frame and coming back again.


> the problem is fundamentally the same as things moving out of sight/out of frame and coming back again

Maybe it is, but doing that with the entire scene instead of just a small part of it makes the problem massively harder, as the model needs to grow exponentially to remember more things. It isn't something that we will manage anytime soon, maybe 10-20 years with current architecture and same compute progress.

Then you make that even harder by remembering a whole game level? No, ain't gonna happen in our lifetimes without massive changes to the architecture. They would need to make a different model keep track of level state etc, not just an image to image model.


10 to 20 years sounds wildly pessimistic

In this sora video the dragon covers half the scene, and its basically identical when it is revealed again ~5 seconds later, or about 150 frames later. The is lots of evidence (and some studies) that these models are in fact building internal world models.

https://www.youtube.com/watch?v=LXJ-yLiktDU

Buckle in, the train is moving way faster. I don't think there would be much surprise if this is solved in the next few generations of video generators. The first generation is already doing very well.


Did you watch the video, it is completely different after the dragon goes past? Its still a flag there, but everything else changed. Even the stores in the background changed, the mass of people is completely different with no hint of anyone moving there etc.

You always get this from AI enthusiast, they come and post "proof" that disproves their own point.


I'm not GP, but running over that video I'm actually having a hard time finding any detail present before the dragon obscures them not either exit frame right when the camera pans left slightly near the end or not re-appear with reasonably crisp detail after the dragon gets out of the way.

Most of the mob of people are indistinct, but there is a woman in a lime green coat who is visible, and then obstructed by the dragon twice (beard and ribbon) and reappears fine. Unfortunately when dragon fully moves past she has been lost to frame right.

There is another person in black holding a red satchel which is visible both before and after the dragon has passed.

Nothing about the storefronts appear to change. The complex sign full of Chinese text (which might be gibberish text: it's highly stylized and I don't know Chinese) appears to survive the dragon passing without even any changes to the individual ideograms.

There is also a red box shaped like a Chinese paper lantern with a single gold ideogram on it at the store entrance which spends most of the video obscured by the dragon and is still in the same location after it passes (though video artifacting makes it more challenging to verify that that ideogram is unchanged it certainly does not appear substantially different)

What detail are you seeing that is different before and after the obstruction?


> What detail are you seeing that is different before and after the obstruction?

First frame, guy in blue hat next to a flag. That flag and the guy is then gone afterwards.

The two flags near the wall are gone, there is something triangular there but there was two flags before the dragon went past.

Then not to mention that the crowd is 6 people deep after the dragon went past, while just 4 people deep before, it is way more crowded.

Instead of the flag that was there before the dragon, it put in 2 more flags afterwards far more to the left.

Third second a guy was out of frame for a few frames, and suddenly gained a blue scarf. AFter dragon went by he turned into a woman. Next to that person was a guy with a blue cap, he completely disappears.

> Most of the mob of people are indistinct

No they aren't, they are mostly distinct and basically all of them changes. If you ignore that the entire mob totally changes both in number and appearance and where it is, sure it is pretty good, except it forgot the flags, but how can you ignore the mob when we talk about the model remembering details? The wall is much less information dense than the mob, so that is much easier to remember for the model, the difficulty is in the mob.

> but there is a woman in a lime green coat who is visible,

She was just out of frame for a fraction of a second, not the big bit where the dragon moves past. The guy in blue jacket and blue cap behind her disappears though, or merges with another person and becomes a woman with a muffler after the dragon moved past.

So, in the end some big strokes were kept, and that was a very tiny part of the image that was both there before and after the dragon moved past so it was far from a whole image with full details. Almost all details are wrong.

Maybe he meant that the house looked mostly the same, I agree the upper parts does, but I looked at the windows and they were completely different, it is full of people heads after the dragon moved past while before it was just clean walls.


We are looking at first generation tech and pretty much every human would recognize the "before dragon" scene as being the same place as the "after dragon" scene. The prominent features are present. The model clearly shows the ability to go beyond "image-to-image" rendering.

If you want to be right because you can find any difference. Sure. You win. But also completely missed the point.


> pretty much every human would recognize the "before dragon" scene as being the same place as the "after dragon" scene

Not in a game and those were enemies, it completely changed what and how many they are, people would notice such a massive change instantly if they looked away and suddenly there were 50% more enemies.

> The model clearly shows the ability to go beyond "image-to-image" rendering.

I never argued against that. Adding a third dimension (time) makes generating a video the same kind of problem as generating an image, it is not harder to draw a straight pencil with something covering it than to draw the scene with something covering it for a while.

But still, even though it is that simple, these models are really bad at it, because it requires very large models and much compute. So I just extrapolated based on their current abilities that we know, as you demonstrated there, to say roughly how long until we can even have consistent short videos.

Note that videos wont have the same progression as images, as the early image models were very small and we quickly scaled up there, while now for video we start at really scaled up models and we have to wait until compute gets cheaper/faster the slow way.

> But also completely missed the point.

You completely missed my point or you changed your point afterwards. My point was that current models can only remember little bits under such circumstances, and to remember a whole scene they need to be massively larger. Almost all details in the scene you showed were missed, the large strokes are there but to keep the details around you need an exponentially larger model.


Where does a sora video turn around backwards? I don’t even maintain such consistency in my dreams.


You can also notice in the first part of the video the ammo numbers fluctuate a bit randomly.


is that something that can be solved with more memory/attention/context?

or do we believe it's an inherent limitation in the approach?


I think the real question is does the player get shot from behind?


great question

tangentially related but Grand Theft Auto speedrunners often point the camera behind them while driving so cars don't spawn "behind" them (aka in front of the car)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: