a) a linear SSM (a form of RNN?) is equivalent to Attention without the scaling and softmax; and
b) Attention is "all you need" and the thing that made Transformers radically outperform all the previous architectures like LSTMs that used to dominate NLP;
does that imply c) the scaling and softmax parts of the attention equation, in particular, is the magic touch that makes Transformers work so well?
The major difference is that transformer state grows as the sequence gets longer, while recurrent models use a fixed size state. So presumably at sequence length (T) > size of state space (N), the transformer will be better on some very specific tasks. Not all, especially those that require the model to select information from the beginning of the sequence conditional on something at the end of the sequence. Transformers can refocus any time, while SSNs need to guess right from the start what to keep and what to drop. SSNs could use the old trick of repeating the input twice to allow the end to condition on the beginning as well.
An important role is held by the softmax function which normalizes the attention scores, allowing the model to weigh different parts of the input sequence dynamically. This means that, unlike RNNs which sequentially process inputs and update states, Transformers can directly access and prioritize information from any part of the sequence, and they are not slower for T < N.
a) a linear SSM (a form of RNN?) is equivalent to Attention without the scaling and softmax; and
b) Attention is "all you need" and the thing that made Transformers radically outperform all the previous architectures like LSTMs that used to dominate NLP;
does that imply c) the scaling and softmax parts of the attention equation, in particular, is the magic touch that makes Transformers work so well?