I recently discovered Mamba and SSMs because my professor suggested I explore them. For context, I’m a master’s student who has just started my research journey. Initially, I wanted to focus on transformer language models, like many of my classmates. However, someone mentioned that this might lead me to pursue a topic that hasn’t been explored yet, making my research more challenging than necessary and possibly resulting in mediocre outcomes. What are your thoughts on this? Thank you.
Mamba is essentially a combination of SSM and transformer technologies. Many private labs are experimenting with it; for instance, Mistral has developed a model based on it.
If you have a solid understanding of transformers and other models, it could be beneficial depending on your career goals.
I also believe xLSTM is worth exploring.
From a purely academic standpoint, transformer language model research is something that almost everyone is currently engaged in. The chances of you creating something unique and advancing the field of transformers are statistically much lower than if you focus on a more niche area.
If your goal is to enter the industry within 1-2 years, pursuing transformer language models is a good choice. I believe your professor sees you as a curious and creative individual and thinks you would benefit from gaining a more mathematical perspective.
Tri Dao, who co-created the Mamba models, has a detailed step-by-step breakdown of the concept, architecture, proofs, and more on his website. He also developed FlashAttention for transformers and believes that SSMs, or some form of recurrent models, are more likely to lead to the next significant advancement in language model performance.
If you enjoy research and are considering pursuing a PhD, studying this subject could be worthwhile. I believe it remains a relatively unexplored but highly active research area. If you’re concerned about the theory behind SSMs (like HiPPO, etc.), I recommend starting with Linear Attention. The paper “Transformers are RNNs” presents attention as an infinite-dimensional recursion that can be approximated by a finite-dimensional version. Recent proposals for SSMs and Linear RNNs build on this concept.
From there, you could explore “Gated Linear Attention” and “Parallelizing Linear Transformers with the DeltaRule over Sequence Length,” which also build on this idea. Gated Linear Attention is very similar to Mamba/Mamba2. All of this work avoids SSM theory, but I believe that for “selective” models like Mamba applied to language modeling, this theory may not be as crucial.
Most of the time research is about trying new things that prolly won’t work out.
Transformers are old news. Mamba and ss architectures are new. Your pi is right to push you there
There’s always opportunities and risks when new subject comes in.