Isn't this the attention mechanism, the reason we're using transformers for thes...

akavi · on July 10, 2024

Ah, good point!

But the model is downstream of the "patch" tokenization, so the cut-down in resolution (compression) of the image has already occurred prior to the point where the model can direct greater "attention".

I think the synthesis is that I'm proposing a per-pixel tokenization with a transformer block whose purpose is to output information at a compression level "equivalent" to that of the patch tokens (is this what an autoencoder is?), but where the attention vector is a function of the full state of the LLM (ie, inclusive of the text surrounding the image)).

Naïvely, I'd think a layer like this that is agnostic to the LLM state needn't be any more computationally costly than the patching computation (both are big honks of linear algebra?), but idk how expensive the "full context attention" feedback is...

(I apologize to anyone who actually understands transformers for my gratuitous (ab|mis)use of terminology)