GPT-2 Circuits - Mapping the Inner Workings of Simple LLMs

lilnim · October 9, 2024, 10:57am

I created an app that extracts interpretable “circuits” from models based on the GPT-2 architecture. While many tutorials show hypothetical scenarios of how layers in an LLM make predictions, this app offers real examples of information flow within the system. For instance, you can observe how features that identify simple grammatical patterns are built from more basic features. If you’re interested in interpretability, I invite you to check it out! I would appreciate your feedback and would like to connect with others who can assist.

Joy · October 9, 2024, 11:00am

Very cool, and the graphs and visuals really get your point across… or they would, if you explained the numbers and colors. You seem to use two different numerical terminologies as well. The first is 4-5 digits long in your feature dependency graphs. Sometimes they are colored differently, often there are multiple numbers per flowchart box. What do the numbers and colors represent? I could not find an explanation on your page. Second is a digit followed by a dot followed by 4-5 more digits. Is this supposed to represent the GPT-2 layer number plus the feature number for that layer?

m_guru · October 9, 2024, 11:01am

That’s super cool! Congratulations

elon · October 9, 2024, 11:02am

This looks interesting but I’m not knowledgable enough to know for sure.

tsuna · October 9, 2024, 11:02am

Hey I wanted to shout out that this is really impressive work, especially if it’s your first work in mechanistic interpretability! This seems definitely at the level of a workshop-ready paper, and could be easily turned into a full paper with some additional work