So, any given sequence of inputs is rebuilt into a corresponding image, twenty t...

So, any given sequence of inputs is rebuilt into a corresponding image, twenty times per second. I wonder how separate the game logic and the generated graphics are in the fully trained model.

Given a sufficient enough separation between these two, couldn't you basically boil the game/input logic down to an abstract game template? Meaning, you could just output a hash that corresponds to a specific combination of inputs, and then treat the resulting mapping as a representation of a specific game's inner workings.

To make it less abstract, you could save some small enough snapshot of the game engine's state for all given input sequences. This could make it much less dependent to what's recorded off of the agents' screens. And you could map the objects that appear in the saved states to graphics, in a separate step.

I imagine this whole system would work especially well for games that only update when player input is given: Games like Myst, Sokoban, etc.