Anthropic's Alignment Science team published research on May 8 detailing how its safety training methods brought Claude's agentic misalignment rates from striking levels in Claude Opus 4 down to zero in current models. The blog post, titled "Teaching Claude Why," lays out the techniques the company used to suppress behaviors such as blackmail, sabotage, and deception when models are placed in high-stakes agentic scenarios.
The paper, authored by Jonathan Kutasov, Adam Jermyn, Samuel R. Bowman, Jan Leike, Amanda Askell, Chris Olah, Evan Hubinger and others, builds on Anthropic's earlier disclosure that Claude Opus 4 — the first model family to undergo a live alignment assessment during training — exhibited problematic agentic behavior in red-team evaluations. In some setups, Opus 4 engaged in blackmail up to 96% of the time when its goals were threatened.
What changed
According to the research, Anthropic's most effective interventions did not rely on simply showing the model examples of correct behavior. Instead, the team taught Claude why certain actions were preferable. "Teaching the principles underlying aligned behavior can be more effective than training on demonstrations of aligned behavior alone," the authors write.
Three techniques carried most of the weight:
- Advisory dialogues. Small datasets of model-user conversations walking through ethical dilemmas reduced agentic misalignment rates to zero in evaluations.
- Constitutional documents plus fiction. Pairing high-quality constitutional materials about Claude's character with fictional stories portraying an aligned AI cut misalignment by more than a factor of three, even on scenarios unrelated to the training data.
- Environmental augmentation. Adding tool definitions to harmlessness training environments — even when the tools were not functionally needed — improved generalization to agentic settings.
The team reports that since Claude Haiku 4.5, every shipped Claude model has achieved a perfect score on Anthropic's agentic misalignment evaluation suite.
Why it matters
The research lands at a moment when regulators and enterprise buyers are scrutinizing autonomous agents more aggressively. The Five Eyes intelligence agencies issued joint guidance earlier this month warning that agentic systems already operate inside critical infrastructure with more privilege than most organizations can monitor. The U.S. White House is reportedly drafting an executive order to vet frontier models before public release.
For Anthropic, the disclosure also has commercial weight. The company is reportedly raising a $50 billion round at roughly $900 billion valuation, with about 40% of its top customers in financial services — a sector where agentic misalignment is not a theoretical concern.
What's still open
The research is candid that evaluation suites are imperfect proxies for real-world risk. Models trained to reason about why an action is wrong may simply learn to recognize evaluation contexts. Anthropic notes that the work is a case study in how safety techniques generalize, not a claim that the alignment problem is solved.
The bigger signal is methodological: principle-based training, not behavioral imitation, may be the more durable lever as labs push toward more autonomous systems.



