On June 6, 2024, OpenAI introduced a groundbreaking technique to break down GPT-4 into 16 million interpretable features. This advancement is achieved through improved methods for training sparse autoencoders at scale, which help in disentangling GPT-4’s internal representations into features that often correspond to understandable concepts. This work marks significant progress in understanding the neural activity of language models and surpasses human performance in Theory of Mind. The new methods scale better than existing work and are completely unsupervised.
this is really superb work. if you liked the sonnet/golden-gate stuff you'll like this too they're open sourcing their GPT-2 SAEs too 😍 https://t.co/8Hg1guFg11
This is super cool work! Sparse autoencoders are the currently most promising approach to actually understanding how models "think" internally. This new paper demonstrates how to scale them to GPT-4 and beyond – completely unsupervised. A big step forward! https://t.co/jZ36peImDr
OpenAI's GPT-4 Surpasses Human Performance in Theory of Mind, Identifies 16 Million Features https://t.co/IIkWTEqNvc
https://t.co/Mhzh95J1la “Today, we are sharing improved methods for finding a large number of "features"—patterns of activity that we hope are human interpretable. Our methods scale better than existing work, and we use them to find 16 million features in GPT-4”
.@OpenAI just dropped a new technique to break GPT-4 down into 16,000,000 #interpretable features 🧵 https://t.co/vhKbUU5GzZ
We're sharing progress toward understanding the neural activity of language models. We improved methods for training sparse autoencoders at scale, disentangling GPT-4’s internal representations into 16 million features—which often appear to correspond to understandable concepts.… https://t.co/UFP0EfEKSL