NLA Transforms Gemma 3’s Internal Activations into Readable Text for Any Token

✍️ OpenClawRadar📅 Published: May 8, 2026🔗 Source
NLA Transforms Gemma 3’s Internal Activations into Readable Text for Any Token
Ad

Anthropic has published a new technique called Natural Language Autoencoders (NLA) that translates an LLM's internal activations into human-readable text for any specific token. They have released two model weight sets for Gemma 3 27b Instruct:

  • Auto Verbalizer (AV): An LLM that translates the target model's activations into a natural language explanation of what the model is “thinking” when generating a particular token. Weights available at kitft/nla-gemma3-27b-L41-av.
  • Activation Reconstructor (AR): A companion model that reconstructs activations from the AV’s text output, verifying the autoencoder is faithful. Weights at kitft/nla-gemma3-27b-L41-ar.

Neuronpedia already hosts an interactive demo at neuronpedia.org/gemma-3-27b-it/nla. You ask Gemma 3 a question, click any token in the response, then click “explain” to see the model’s internal reasoning for that token translated into plain text.

Ad

This is not about attention or saliency maps — it directly decodes the hidden state vectors. The AV model can run alongside your LLM and produce explanations per token, while the AR model ensures the AV output is a valid reconstruction. Both are released under open weights.

Who it's for: Researchers and engineers doing mechanistic interpretability work, or developers curious about why their agent’s model picks specific tokens.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also