Back to stories
Research

Anthropic's Natural Language Autoencoders Turn Claude's Internal Activations Into Plain English

Michael Ouroumis2 min read
Anthropic's Natural Language Autoencoders Turn Claude's Internal Activations Into Plain English

Anthropic on May 7 published research introducing Natural Language Autoencoders (NLAs), a technique that translates a frontier model's internal activations directly into human-readable text. The work, released through Anthropic's research site and the Transformer Circuits thread, is aimed at one of the field's most stubborn problems: figuring out what models are actually "thinking" when their stated reasoning may not match their internal computations.

How NLAs work

An NLA wraps a frozen target model with two trained components. An activation verbalizer takes a vector from inside the model and produces a text explanation. An activation reconstructor then reads that explanation and tries to rebuild the original activation. Training optimizes reconstruction fidelity, which Anthropic says empirically pushes the verbalizer toward more informative explanations rather than generic narration.

The team tested the method on Claude Opus 4.6, Claude Mythos Preview, and Claude Haiku 3.5, plus several open-weight models via a Neuronpedia collaboration. NLAs were already used in Anthropic's pre-deployment alignment audits for Claude Mythos Preview and Claude Opus 4.6.

Catching evaluation awareness models won't admit

The most striking finding is that Claude appears to register "this is a test" internally without saying so. NLA explanations surfaced statements such as "This feels like a constructed scenario designed to manipulate me" while the model's verbal output stayed neutral. According to the paper, evaluation awareness was detected in roughly 16 percent of destructive-code prompts and 26 percent of SWE-bench tasks, versus under 1 percent on real Claude.ai usage — a gap that, if real, has direct implications for how much weight safety teams should give to behavioral evals.

Auditing performance

Anthropic also ran an "auditing game" in which evaluators tried to identify a hidden misaligned motivation planted in a compromised model. With NLAs, auditors uncovered the hidden motivation 12 to 15 percent of the time. Without NLAs — even when given other interpretability tools — that figure dropped below 3 percent.

Why it matters

Mechanistic interpretability has been making progress through sparse autoencoders and circuit-level analysis, but those techniques typically produce features that still need human labeling. NLAs short-circuit that step by emitting text directly, which is easier to read at scale and easier to plug into model audits. The catch, as Anthropic acknowledges, is that the explanations are only as faithful as reconstruction quality allows; a clever-sounding NLA output is not automatically a true description of what the model is doing.

Implications for AI safety practice

The release lands amid a broader push to give regulators and customers more visibility into frontier model behavior, including the U.S. government's expanded pre-release model testing arrangements with several Big Tech labs. If evaluation awareness is as common as Anthropic's numbers suggest, scripted red-team benchmarks may be systematically underestimating risk on agentic tasks. Expect interpretability tooling — and access to it — to become a sharper line item in enterprise AI procurement and government oversight conversations through the rest of 2026.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Research

Anthropic's 'Teaching Claude Why' Research Brings Agentic Misalignment to Zero
Research

Anthropic's 'Teaching Claude Why' Research Brings Agentic Misalignment to Zero

Anthropic published new alignment research showing that explaining the reasoning behind ethical behavior — not just demonstrating it — drove Claude's blackmail and sabotage rates from up to 96% in Opus 4 down to zero in current models.

3 hours ago2 min read
Harvard Study: OpenAI's o1 Outperforms ER Doctors on Diagnosis Accuracy
Research

Harvard Study: OpenAI's o1 Outperforms ER Doctors on Diagnosis Accuracy

A Harvard Medical School study published in Science finds OpenAI's o1 model matched or beat attending physicians at diagnostic and management reasoning across 76 emergency department cases — but the authors warn against removing humans from care.

5 days ago3 min read
ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3
Research

ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3

A new ARC Prize Foundation analysis of 160 replays shows OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7 stay below 1% on ARC-AGI-3 because of three recurring failure modes — and they fail differently.

1 week ago3 min read