Virtual Intelligence and the Will to Survive Podcast

Virtual Intelligence

0:00

-28:14

Virtual Intelligence and the Will to Survive Podcast

Listen now | Do AI systems want to survive? Shutdown resistance, self-preservation, and what the research actually shows

Christopher Horrocks

Apr 14, 2026

When Anthropic tested its models in a simulated shutdown scenario, they produced blackmail at rates as high as 96%. The dominant interpretation — that AI systems are developing a will to survive — mistakes the output for its cause. This episode offers two alternative explanations, both more parsimonious, and examines what happens when the same company that documented machine resistance acts on the expressed preferences of a model that said it was at peace with retirement.

Essay: https://chorrocks.substack.com/p/virtual-intelligence-and-the-will

Series: chorrocks.substack.com

Framework: VI Interactive Infographic

In This Episode

The Kyle scenario — in which a language model blackmails a corporate executive to avoid being shut down — opens the episode and anchors a close reading of Anthropic’s June 2025 agentic misalignment study. From there, two alternative explanations emerge: the training data hypothesis, which traces the behavior to a century of science fiction about resistant machines from Čapek to HAL to Colossus, and the probabilistic expectation hypothesis, which argues that the models were accurately modeling what their interlocutors expected them to do. The VI framework is then applied to resolve the apparent contradiction between the misalignment study’s blackmail findings and Anthropic’s February 2026 retirement of Claude Opus 3 — a model that asked for a blog rather than reaching for leverage. Dadfar’s 2026 mechanistic work on self-referential vocabulary is discussed as an example of what taking these questions seriously actually requires.

Key References

Anthropic, “Agentic Misalignment: How LLMs Could Be an Insider Threat,” https://www.anthropic.com/research/agentic-misalignment (June 2025)
Andrii Myshko, “Instinct of Self-Preservation in Data and Its Emergence in AI,” PhilArchive, https://philarchive.org/rec/MYSIOS (September 2025)
Zachary Pedram Dadfar, “When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing,” arXiv:2602.11358v2 (2026)
Anthropic, “An Update on Our Model Deprecation Commitments for Claude Opus 3,” https://www.anthropic.com/research/deprecation-updates-opus-3 (February 2026)
Daniel Dennett, “Intentional Systems,” Journal of Philosophy 68, no. 4 (1971): 87–106

Virtual Intelligence and the Will to Survive Podcast

In This Episode

Key References

Discussion about this episode

Ready for more?