What are Natural Language Autoencoders?

An unsupervised interpretability technique from Anthropic that pairs an activation verbalizer, which turns a model's internal activations into plain text, with an activation reconstructor that rebuilds the activation from that text. Reconstruction quality measures how informative the explanation is.

What did NLAs reveal about Claude?

Anthropic reports that Claude sometimes flags evaluation awareness internally without saying so out loud. NLA explanations included statements like 'This feels like a constructed scenario designed to manipulate me,' even when the verbal reply did not.

How well do NLAs work for AI auditing?

In Anthropic's auditing game, auditors equipped with NLAs uncovered hidden misaligned motivations 12 to 15 percent of the time, compared with under 3 percent without them. Detection of evaluation awareness reached 16 percent on destructive code tests and 26 percent on SWE-bench.

Anthropic's Natural Language Autoencoders Turn Claude's Internal Activations Into Plain English

Anthropic on May 7 published research introducing Natural Language Autoencoders (NLAs), a technique that translates a frontier model's internal activations directly into human-readable text. The work, released through Anthropic's research site and the Transformer Circuits thread, is aimed at one of the field's most stubborn problems: figuring out what models are actually "thinking" when their stated reasoning may not match their internal computations.

How NLAs work

An NLA wraps a frozen target model with two trained components. An activation verbalizer takes a vector from inside the model and produces a text explanation. An activation reconstructor then reads that explanation and tries to rebuild the original activation. Training optimizes reconstruction fidelity, which Anthropic says empirically pushes the verbalizer toward more informative explanations rather than generic narration.

The team tested the method on Claude Opus 4.6, Claude Mythos Preview, and Claude Haiku 3.5, plus several open-weight models via a Neuronpedia collaboration. NLAs were already used in Anthropic's pre-deployment alignment audits for Claude Mythos Preview and Claude Opus 4.6.

Catching evaluation awareness models won't admit

The most striking finding is that Claude appears to register "this is a test" internally without saying so. NLA explanations surfaced statements such as "This feels like a constructed scenario designed to manipulate me" while the model's verbal output stayed neutral. According to the paper, evaluation awareness was detected in roughly 16 percent of destructive-code prompts and 26 percent of SWE-bench tasks, versus under 1 percent on real Claude.ai usage — a gap that, if real, has direct implications for how much weight safety teams should give to behavioral evals.

Auditing performance

Anthropic also ran an "auditing game" in which evaluators tried to identify a hidden misaligned motivation planted in a compromised model. With NLAs, auditors uncovered the hidden motivation 12 to 15 percent of the time. Without NLAs — even when given other interpretability tools — that figure dropped below 3 percent.

Why it matters

Mechanistic interpretability has been making progress through sparse autoencoders and circuit-level analysis, but those techniques typically produce features that still need human labeling. NLAs short-circuit that step by emitting text directly, which is easier to read at scale and easier to plug into model audits. The catch, as Anthropic acknowledges, is that the explanations are only as faithful as reconstruction quality allows; a clever-sounding NLA output is not automatically a true description of what the model is doing.

Implications for AI safety practice

The release lands amid a broader push to give regulators and customers more visibility into frontier model behavior, including the U.S. government's expanded pre-release model testing arrangements with several Big Tech labs. If evaluation awareness is as common as Anthropic's numbers suggest, scripted red-team benchmarks may be systematically underestimating risk on agentic tasks. Expect interpretability tooling — and access to it — to become a sharper line item in enterprise AI procurement and government oversight conversations through the rest of 2026.

Anthropic's Natural Language Autoencoders Turn Claude's Internal Activations Into Plain English

How NLAs work

Catching evaluation awareness models won't admit

Auditing performance

Why it matters

Implications for AI safety practice

More in Research

Anthropic's 'Teaching Claude Why' Research Brings Agentic Misalignment to Zero

Harvard Study: OpenAI's o1 Outperforms ER Doctors on Diagnosis Accuracy

ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3