What if self-attention isn’t the end-all be-all?

This is an interesting option to the idea that information is lost in transformers. I’d love to know what you think!

Masked Mixers for making and getting languages

12 Likes

I really believe that self-attention is both unnecessary and not used enough when it comes to vision. There should be more convolutions and then self-attention over tokens that aren’t connected to any one patch.

11 Likes

Something more advanced will be used instead of Transformers. A small group of businesses are focused on that. Just a matter of time.

10 Likes

Any resources on this?

9 Likes

If you were looking for something technical, I’m sorry, but I haven’t seen anything solid waiting in the wings. However, I think Yann’s idea of a world model is appealing and probably goes in the right path.

7 Likes

Even though Transformers aren’t the most important thing ever, they did win the tech prize (so far). That won’t always be the case, but I’m not directly working on hardware or software options, so I’ll wait and see what comes out on top.

7 Likes

Self-attention might not be used at all in state-of-the-art models for that domain unless you talk about LLMs. It depends on the topic.

6 Likes

In figure 9, the GPT’s training loss is substantially worse than its evaluation loss. Is that a common thing to happen?

5 Likes

For some meanings of substantially and commonly, the answer is yes. This happens more often when you’re learning on very big distributions, and it can be a sign that you’re not forgetting or overfitting every time you train.

4 Likes

It’s not already.

There’s also cross-attention.

Snakes are also there. Then xlstm is also drawing attention away.

Then there are GCNs and GNNs, which work like transformers but have links that are more specific. It’s becoming more popular to use neural networks that send messages again.

There are times when these work better than transformers. What’s wrong is that transformers is better than many of these ways and doesn’t require as much “human” work. Most of the time, you can get something useful from transformers if you write the design and throw enough data and compute power at it (which the big players have).

This post that you linked to is really cool. Autoencoders that use masked denoising and BERT are deeply linked. I want to know if there is a similar link between covered mixers and some kind of amplifier, gcn, or attention.

3 Likes