I don't understand either. It seems the general idea is that they calculate atte...

Carlseymanh · on Oct 8, 2024

I think there might be some communalities with system engineering, where you subtract the output from the input in order to get a control signal that steers the plant to the target values. I too fail to see how that would be supposed to work in practice.

kelseyfrog · on Oct 8, 2024

The values between the groups are also going to diverge during training due to the structure of the DiffAttn equation.

The analogy I can think of is when you're paying attention to a variety of things and you actively avoid concentrating on something because it will distract you. You don't give it zero attention, you give it negative attention.