I don't understand either. It seems the general idea is that they calculate attention twice, which due to random initialization might be expected to give two slightly different results. I'd have thought that what these two attention maps would have in common would be the signal, and where they would differ would be noise, so rather than subtracting them (resulting in all noise?!) what you really want is to add (so the common signal gets reinforced) and normalize.
I think there might be some communalities with system engineering, where you subtract the output from the input in order to get a control signal that steers the plant to the target values. I too fail to see how that would be supposed to work in practice.
The values between the groups are also going to diverge during training due to the structure of the DiffAttn equation.
The analogy I can think of is when you're paying attention to a variety of things and you actively avoid concentrating on something because it will distract you. You don't give it zero attention, you give it negative attention.