Thank you for sharing this. I've added it to my reading list!

pama · on June 4, 2024

The last figure in this paper is a huge disappointment when you consider how reality meets such theoretical arguments. A typical mamba model would have 100 layers (maybe 60 for snaller ones). That figure scales up to 4 layers and these are sufficient for the problem they consider so the argument goes that another RNN only needed one layer for it.

imtringued · on June 4, 2024

Why bother with more layers? The inability of the conventional transformer to not have recurrent layers is mostly a weakness. The final layers of an LLM do very little (except the last) and most of the time just let the token pass through from the layer that actually determined it. A recurrent architecture could dynamically perform as many passes as it needs to produce the next token. This would result in a speedup for easy tokens and a slowdown for hard tokens compared to the fixed "you must go through all layers regardless" architecture of classical LLMs.

pama · on June 4, 2024

Mamba is technically a recurrent neural network (and typically decodes as such), simply of a constrained architecture that among other things keeps the norms of its matrixes finite independently of gating. I think that the above paper confused the word “state” to make some cheap points that may have hit the social networks at the time, yet it didnt demonstrate a practical benefit of earlier RNN over mamba, and the example looks a bit silly. If one had used resnet in vision with less than 4 layers to make a point, then that would be the equivalent of this paper. It might have been stronger, if the authors cared more, but we will not find out.

f_devd · on June 4, 2024

(disclaimer: I've not looked at mamba specifically yet) but state spaces as used current are different from traditional RNNs in a very simple way: the state is linearly (often associatively) accumulated while in RNN the state is passed/accumulated as an input/output of a non-linear function.