An Intuitive Explanation of How LLMs Work

Hey

I wrote a blog post that makes it easy to understand how LLMs work.

We begin with a broad view of LLMs as personal helpers. As we go along, we get more specific and talk about ideas like tokenization, sampling, and embeddings.

I’ve added some pictures to help you understand some of the ideas better. I also talked about some of the problems with the way LLMs are written now, like how they miss the Rs in “strawberry” and turn the string “copenhagen” around.

I hope it helps you!

Tell me what you think or if you have any questions.Hey

I wrote a blog post that makes it easy to understand how LLMs work.

We begin with a broad view of LLMs as personal helpers. As we go along, we get more specific and talk about ideas like tokenization, sampling, and embeddings.

I’ve added some pictures to help you understand some of the ideas better. I also talked about some of the problems with the way LLMs are written now, like how they miss the Rs in “strawberry” and turn the string “copenhagen” around.

I hope it helps you!

Tell me what you think or if you have any questions.

7 Likes

Because you asked, here is some critical feedback:

You can tell it’s “sparse” because there’s only light text and funny gifs. Visualizations and animation are great.

The order of “probabilities, not words,” “tokens, not words,” and “token ids, not words” isn’t very smooth or even, and it gets into technical details that are more like “trivia” than anything else. But there are important pieces missing that keep this from being well-integrated for a new learner.

1 and 2 don’t seem to go with 7.

The seventh one has the most details, but it’s just a well-known picture of a generator building from someone else.

I teach young adults ages 12 to 17 AI using Python as a volunteer. This is my normal layout:

Neural networks that are very simple; we’ll get to the math, but we’ll leave calculus and linear algebra out for now.

Use simple NNs to train and draw conclusions.

Start talking about feature extraction and make use of more complex designs like ConvNets.

With Sentence Transformer embeddings, you can get really deep with feature extraction and representational learning.

You can use fine-tuning and transfer learning to show how extracted features and representational learning can help solve problems and how a NN’s design can be used again.

Examine controlled, unsupervised, and semi-supervised learning using what we already “know how to do” (word embeddings, umap, and grouping work well here).

Introduce the transformer design, but don’t go into too much detail. Some “the tokens pay attention” comments are fine, but remind them of how powerful the ConvNet is at extracting features.

Fill in some gaps about how symbols are learned through BPE

Use what we’ve learned so far to go over the full structure of an LLM.

Check out how LLMs save “facts” in the FCs.

Discover how instruction setting works and what it does to a “fancy auto-complete”

Learn about gotchas, ethics, safety, and harmony all over the place.

4 Likes

How do LLMs and Transformers work differently? These seven points you made in your post don’t seem to be unique to LLMs.

4 Likes

It looks like a lot of LLMs are transformers. In particular, decoders and transformers.

The first transformer had both an encoder and a decoder and was used for translation. GPT is only a decoder, while BERT is only an encoder.

Anything that can guess the next word based on its surroundings is called a language model. I wouldn’t call BERT an LLM in the sense of “predicting the next token” because it uses masked language modeling, which doesn’t do that.

3 Likes

Certain types of LLMs have different designs. Transformers are one type, but so are Mamba and RNNs (which were all the rage until 2018).

Since my post is mostly about new LLMs that use transformers, many of the points I make are also true for transformers in general.

2 Likes

What would you say are the differences between a Transformer (the original encoder-decoder one from 2017), a “smaller” model like BERT or GPT-2, and a LLM?

This post is meant to give you a basic idea of how LLMs work without getting too complicated. It will stay on topic of neural networks and the math behind the transformer design.

It’s mostly for three types of people:

For people who are new to LLMs or robots and don’t know much about machine learning

ML engineers who don’t work in NLP and come from a different field, like time series or something else.

ML engineers who work with LLMs every day and get bogged down in the specifics of learning rates, loss curves, epochs, and so on

That’s me, number 3, and I found it very helpful to read this again whenever my model turned out to be wrong. That’s probably because I didn’t use the right prompt form, added the BOS token twice by accident, or forgot to add the stop token when I was prepping the data. It’s not because the learning rate or planner wasn’t good enough.

  1. It’s meant to be an introduction to the transformer design, so it doesn’t go into specifics about what a feed-forward is or how attention works. That was just a “lead” for the reader to follow.

The picture is from the gpt-2 paper and is pretty well-known, so I didn’t think it needed credit. I will add a reference to the post, though.

Yeah, I think my criticism still stands. If that was your objective, it seems even more aimless. I wasn’t critiquing the use of a well-known image of that architecture, I was just typing on mobile and that was the reference that came out.