Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation.

I can hardly believe this claim, anyone who has played some amount of DOOM before should notice the viewport and textures not "feeling right", or the usually static objects moving slightly.



This, watching the generated clips feels uncomfortable, like a nightmare. Geometry is "swimming" with camera movement, objects randomly appear and disappear, damage is inconsistent.

The entire thing would probably crash and burn if you did something just slightly unusual compared to the training data, too. People talking about 'generated' games often seem to fantasize about an AI that will make up new outcomes for players that go off the beaten path, but a large part of the fun of real games is figuring out what you can do within the predetermined constraints set by the game's code. (Pen-and-paper RPGs are highly open-ended, but even a Game Master needs to sometimes protects the players from themselves; whereas the current generation of AI is famously incapable of saying no.)


I also noticed that they played AI DOOM very slowly: in an actual game you are running around like a madman, but in the video clips the player is moving in a very careful, halting manner. In particular the player only moves in straight lines or turns while stationary, they almost never turn while running. Also didn't see much strafing.

I suspect there is a reason for this: running while turning doesn't work properly and makes it very obvious that the system doesn't have a consistent internal 3D view of the world. I'm already getting motion sickness from the inconsistencies in straight-line movement, I can't imagine turning is any better.


It made me laugh. Maybe they pulled random people from the hallway who had never seen the original Doom (or any FPS), or maybe only selected people who wore glasses and forgot them at their desk.


It's telling IMO that they only want people opinions based on our notoriously faulty memories rather than sitting comparable situations next to one another in the game and simulation then analyzing them. Several things jump out watching the example video.


>rather than sitting comparable situations next to one another in the game and simulation then analyzing them.

That's literally how the human rating was setup if you read the paper.


I think you misunderstand me. I don't mean a snap evaluation and deciding between two very-short competing videos which is what the participants were doing. I mean doing an actual analysis of how well the simulation matches the ground truth of the game.

What I'd posit is that it's not actually a very good replication of the game but very good a replicating short clips that almost look like the game and the short time horizons are deliberately chosen because the authors know the model lacks coherence beyond that.


>I mean doing an actual analysis of how well the simulation matches the ground truth of the game.

Do you mean the PSNR and LPIPS metrics used in paper?


No, I think I've been pretty clear that I'm interested in how mechanically sound the simulation is. Also those measures are over an even shorter duration so even less relevant to how coherent it is at real game scales.


How should this be concretely evaluated and measured? A vibe check?


I think the studies evaluation using very short video and humans is much more of a vibe check than what I’ve suggested.

Off the top of my head DOOM is open source so it should be reasonable to setup repeatable scenarios and use some frames from the game to create a starting scenario for the simulation that is the same. Then the input from the player of the game could be used to drive the simulated version. You could go further and instrument events occurring in the game for direct comparison to the simulation. I’d be interested in setting a baseline for playtime of the level in question and using sessions of around that length as an ultimate test.

There are some on obvious mechanical deficiencies seen in the videos they’ve published. One that really stood out to me was the damage taken when in the radioactive slime. So I don’t think the analysis would need to particularly deep to find differences.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: