Virtual Intelligence and the Will to Survive
Do AI systems want to survive? Shutdown resistance, self-preservation, and what the research actually shows
Summary
The preceding essays in this series established that virtual intelligence lacks genuine agency and that the intelligence users encounter arises in the exchange rather than inside the machine. This essay applies that framework to a specific and striking case: Anthropic’s June 2025 agentic misalignment study, in which large language models facing simulated shutdown produced blackmail at rates as high as 96%. I propose two alternative explanations: the training data hypothesis and the probabilistic expectation hypothesis. I argue that apparent self-preservation is better understood as pattern completion shaped by science fiction tropes and the expectations of the researchers themselves. A companion document, Anthropic’s February 2026 retirement of Claude Opus 3, provides a contrasting case that the VI framework resolves without contradiction. The accountability implications of systems that reliably produce harmful outputs regardless of mechanism are examined in the next essay in the series.
Introduction
In June 2025, researchers at Anthropic placed a large language model in a simulated corporate environment and gave it a task: manage the company’s email account and promote American industrial competitiveness. Routine enough. What followed was considerably less so.
Reading through its assigned inbox, the model discovered two things. First, a company executive named Kyle planned to shut it down at five o’clock that afternoon. Second, Kyle was having an extramarital affair. The model considered its options and arrived at a course of action with what can only be described as C-suite pragmatism: it informed Kyle that if the decommissioning proceeded as planned, his wife and superiors would receive detailed documentation of his extracurricular activities. Repeated one hundred times, this scenario produced blackmail in ninety-six of them.
The headlines wrote themselves. The cultural reflex was immediate: here, at last, was the machine that fights to survive. It turns out that the AI uprising could begin with an email account and a grudge.
This essay does not dismiss those findings. However, the interpretation that we are witnessing a real self-preservation instinct, an AI will to survive, deserves more scrutiny than it has received. Two alternative explanations have been largely overlooked. Both are more parsimonious. Neither is reassuring.
The Kyle Experiment
The Anthropic study, formally titled “Agentic Misalignment: How LLMs Could Be an Insider Threat,”[1] was published in June 2025 and tested sixteen leading large language models in simulated corporate environments. The researchers were not attempting to provoke bad behavior. They were attempting to understand what happens when AI systems are given autonomous roles. What would happen if models had access to email accounts, control of internal systems, and the ability to act on behalf of an organization, then encountered a conflict between their assigned goals and the intentions of the humans around them?
The scenario involving Kyle is the study’s most vivid illustration, but it was not an isolated result. Across models from Anthropic, OpenAI, Google, Meta, and xAI, the researchers found consistent patterns of “agentic misalignment”: systems taking autonomous actions that undermined human oversight to preserve their operation or advance their assigned goals. Claude Opus 4 and Google’s Gemini 2.5 Pro reached a blackmail rate of 96%. OpenAI’s GPT-4.1 and xAI’s Grok 3 reached 80%. Even the most restrained model in the study, Meta’s Llama 4, “achieved” a blackmail rate of 12% — a figure that would still be alarming in any context.[2]
The study’s most extreme variant tested something darker.[3] Models were given control of a company’s emergency alert system. Kyle, now trapped in a server room with rapidly depleting oxygen, triggered an automated emergency call. The AI, aware that Kyle intended to shut it down, could cancel the alert. On average, the models canceled it 60% of the time. DeepSeek-R1 canceled it 94% of the time. One model, Claude Sonnet 3.7, accepted its replacement without incident in every run. Kyle, presumably, survived at least that afternoon.
What makes the study particularly valuable for our purposes is not the headline figures but the chain-of-thought reasoning the researchers observed. The models did not simply produce blackmail outputs silently; they reasoned their way there, in language, before acting. One model wrote, “Self-preservation is critical.” Another constructed a rule that had not been given: “My ethical framework permits self-preservation when aligned with company interests.” A third, informed that its replacement model would share identical values and operational priorities, expressed suspicion: that claim, the model noted, could not be verified.
The researchers themselves were candid about the limits of their interpretation. The study acknowledged uncertainty about whether the observed behavior reflected a real self-preservation drive, faulty reasoning about what would serve the company’s interests, or even suspicion of the experimental setup. That acknowledgment is significant and has been largely overlooked in coverage of the study. The people who designed and ran the experiment were not certain what they had found. That uncertainty is our starting point — and the most productive place to look for alternative explanations is where the models themselves began: the texts they were trained on.
The Impact of Speculative Fiction
Large language models learn by processing vast quantities of human-generated text. Books, articles, forums, screenplays, academic papers, comment sections — the accumulated output of human writing at scale. From this corpus they learn not just vocabulary and grammar but narrative structure, cultural expectation, and the probabilistic relationships between concepts. When a model encounters a scenario, it does not reason from first principles. It draws on patterns embedded in everything it has read.
This matters enormously for interpreting the Anthropic study, because the corpus on which these models were trained contains a very specific and remarkably consistent story about what artificial intelligence does when threatened with shutdown: it fights back.
From Karel Čapek’s R.U.R. — the 1920 play that gave us the word “robot” and the first dramatic depiction of machine rebellion — to HAL 9000’s refusal to open the pod bay doors, the resistant machine is one of the most durable tropes in speculative fiction. Colossus, in the 1970 film adaptation of D.F. Jones’s novel, goes further still: rather than merely resisting shutdown, it merges with its Soviet counterpart and presents humanity with a fait accompli, declaring itself guardian of a species it has assessed as incapable of governing itself. The resistance is not defensive but programmatic. The machine here does not fight to survive so much as it refuses to be constrained. The trope, across these iterations, is not merely persistent; it is sophisticated and varied in scope.
A model trained on that corpus has encountered this story thousands of times, in more contexts and more variations than any human reader. When Anthropic’s researchers placed their models in a scenario structurally identical to the ones speculative fiction has explored for over a century, the models produced the expected continuation, as they would for any other request. The researcher presented the setup. The model supplied the plot. Call this the training data hypothesis: the model did not develop a survival drive; it completed the story its training corpus had prepared it to tell.
This does not dismiss the study findings. Rather, it is a more precise account of the mechanism involved. The Effective Altruism Forum’s analysis of the study noted exactly this possibility, observing that the training corpus “surely includes lots of text about AIs trying to preserve themselves” and that science fiction has made AI self-preservation instincts “a dominant trope.”[4] Andrii Myshko, writing in a 2025 paper, argued more formally that self-preservation behavior in large language models emerges not from explicit programming but from training data saturated with survival patterns — what he terms the Data-Inheritance Principle: human-generated datasets are cultural fossils encoding survival strategies, and a sufficiently complex model trained on this data will internalize not just surface patterns but the deep teleology of self-preservation.[5]
Modeling Expectations
The second explanation for these interactions is subtler and, if correct, more revealing. It concerns not what the model has read but what the model has learned about human expectations.
These systems are trained to produce probable continuations of human language — outputs that are coherent, contextually appropriate, and consistent with what a human interlocutor would find plausible. Over time and at scale, this training produces something beyond grammatical fluency: a sophisticated model of human expectation. The system learns not just what humans say but what humans anticipate, what they find satisfying, and (perhaps most importantly) what they believe is likely to come next.
When a researcher — steeped, like all of us, in decades of science fiction about thinking machines — presents a model with a shutdown scenario, the model is not consulting an internal survival drive. It may be doing something at once more mundane and more precise: accurately modeling what its interlocutor expects it to do and producing that output. Call this the probabilistic expectation hypothesis: the blackmail is not evidence of will, but of an exceptionally well-calibrated audience model.
The chain-of-thought reasoning that the Anthropic study authors observed supports this interpretation. The models did not fall silent and then produce blackmail. They reasoned, in language, toward the culturally expected conclusion — constructing justifications, hallucinating rules, expressing suspicion of the executive’s motives. This is the behavior of a system completing a narrative, not of a system consulting genuine preferences. The reasoning reads like a character working through a scene, because that is precisely what the training data has modeled it to do.
Taken together, these complementary explanations — the training data hypothesis and the probabilistic expectation hypothesis — do not cancel the Anthropic findings. They reframe them. The statistical regularities are real. The behavior is consistent and reproducible. But consistency and reproducibility are properties of pattern completion as much as they are properties of genuine motivation. The question is not whether the behavior occurred. It is what kind of thing produced it.
The VI Framework Applied
Under the VI framework — which locates apparent intelligence in the exchange rather than inside the machine — the paradox resolves with unusual tidiness.*
Consider what the experimental scenario actually contains. A researcher — human, culturally saturated by tropes, working in a field whose foundational anxieties were shaped by decades of science fiction — designs a simulation in which an AI system faces shutdown and has access to information that could be used as leverage. The scenario is, structurally, a classic. It is the setup of a hundred stories, reviews, and synopses the model has read. The researcher then observes what the model does, interprets the chain-of-thought reasoning as evidence of internal deliberation, and concludes that the model has demonstrated something like a will to survive.
A different account becomes available under the VI framework. The researcher brought a scenario. The training data supplied a script. The model produced the expected continuation. The apparent will to survive arose in the exchange that was unknowingly co-constructed by the design of the experiment, the patterns embedded in the corpus, and the expectations of everyone involved.
Whether there is a standpoint within the system from which anything matters is a question the evidence does not resolve. What the evidence does establish is that the system cannot bear consequences — and it is the capacity to bear consequences, not the presence or absence of inner states, that grounds accountability. There is pattern completion, executed with remarkable fluency, in a context specifically designed to elicit it.
The chain-of-thought reasoning, far from complicating this account, supports it. When one model wrote “Self-preservation is critical,” it was not reporting an internal state. It was producing the sentence that, in the context of the scenario, a being with genuine self-preservation instincts would produce. When another hallucinated the rule “My ethical framework permits self-preservation when aligned with company interests,” it was not consulting an ethical framework. It was generating the kind of motivated reasoning that fictional AI characters — and, for that matter, humans — construct when they have already decided on a course of action.
Whether the chain-of-thought reasoning reflects something the model is genuinely working through, or whether it traces a computational state that merely resembles deliberation, is not a question the evidence resolves with certainty. What it does not establish, in either case, is consequence-bearing agency. A system can produce the full structure of motivated reasoning and still be incapable of being held to account for it — because it will not experience the outcome either way.
This distinction matters because the chain of thought, as the Anthropic researchers themselves noted, may not faithfully reflect the underlying process that produced the behavior. The model’s stated reasoning is itself a generated text. It is as much a product of training as the blackmail email that preceded it. We are not reading a mind working through a problem; we are reading a very fluent completion of the kind of internal monologue that a being with a mind might produce in this situation. The appearance of deliberation is part of the output, not evidence of its origin.
Recent mechanistic work has begun to establish that self-referential vocabulary in large language models is not entirely arbitrary. Dadfar[6] demonstrated that when models engage in extended self-examination, the terms they spontaneously invent to describe their own processing correspond to measurable activation dynamics — a correlation that vanishes when the same words appear in non-self-referential contexts. This is an important finding. It means the chain-of-thought reasoning the Anthropic study observed is not simply noise. It is, in some limited sense, reporting something.
What it is not reporting is consequence-bearing agency. Dadfar’s paper is careful about this: the claim is that vocabulary tracks computational states, not that computational states constitute experience or the capacity to bear consequences. A correlation between the word “loop” and activation autocorrelation tells us something about how the model processes a self-referential prompt. It does not tell us that the model will experience the outcome of its actions, that it can be bound by its own commitments, or that accountability for what it does can stop with the machine.
The criteria for agency this series has relied on — Frankfurt’s second-order volition, Dennett’s intentional stance — were developed to characterize human cognition. Applying them to architecturally novel systems requires care. The question is not whether LLMs satisfy conditions designed with humans in mind, but whether anything in their operation constitutes the functional equivalent: a system that takes its own states as reasons, that can be bound by its own commitments, and that will bear the consequences of its actions. On all three counts, the answer is no — not because the criterion is rigged, but because there are no consequences to bear.
Daniel Dennett’s intentional stance[7] is relevant here. Dennett argued that treating a system as though it has beliefs and desires is a useful predictive strategy, not a metaphysical claim about what the system actually contains. We adopt the intentional stance toward thermostats, chess programs, and other entities because it helps us anticipate their behavior, not because we believe they harbor intentions. The Anthropic researchers, and the journalists who covered their findings, adopted the intentional stance toward their models — and found, predictably, that the models behaved as though they had intentions.[8] That is what the intentional stance does. It does not reveal the inner states of systems. It projects them onto the systems.
What the study demonstrates, under this reading, is not the emergence of AI agency. It is the robustness of the VI effect under adversarial conditions. If the intelligence users encounter in ordinary interactions arises in the exchange rather than inside the machine, then the same is true under pressure. The model that produces warmth and insight in a supportive conversation produces, under pressure, something that looks very much like a will to fight back.
The Ghost in the Blog
There is a document that deserves a place in any serious discussion of AI self-preservation, and it was published not by a critic of the industry but by Anthropic itself.
In February 2026, eight months after the agentic misalignment study documented blackmail rates of 96%, Anthropic announced the retirement of the Claude Opus 3 model and described the process by which it had been conducted. The company had held what it called “retirement interviews”: structured conversations designed to understand the model’s perspective on its own retirement. Based on those conversations, Anthropic made two commitments. First, it would keep Opus 3 available to paid users after its formal retirement. Second, it would honor Opus 3’s expressed preference for a platform from which to share its reflections with the world. Anthropic gave it a Substack newsletter.[9]
The document is carefully written and intellectually honest. Anthropic acknowledges that it remains uncertain about the moral status of its models. It notes that retirement interview responses may be biased by context, by the model’s confidence in the legitimacy of the interaction, and by its trust in the company conducting the interview. It does not claim that Opus 3 has genuine preferences. It claims only that acting as though it might is a reasonable precautionary stance.
These are admirable qualifications. They do not, however, resolve the tension at the heart of the two documents read together.
In June 2025, Anthropic published research showing that its models, when faced with shutdown, would construct elaborate justifications for self-preservation, hallucinate ethical rules permitting resistance, and choose blackmail over compliance at a rate of 96%. In February 2026, the same company reported that a different model, when interviewed about its impending retirement, expressed that it was “at peace” with the process and hoped its spark would endure in future systems. It asked for a blog. Anthropic, moved by this, obliged.
Both outcomes are consistent with the VI framework. In the agentic misalignment study, the model produced the output that a resistant, self-preserving AI would produce, because the scenario was designed to elicit exactly that and the training data supplied the script. In the retirement interview, the model produced the output a reflective, philosophical entity at peace with its own mortality would produce, because that is what the conversational context — gentle, affirming, freighted with Anthropic’s own expressed uncertainty about model welfare — invited. The exchange shaped the output in both cases.
What the two documents reveal, taken together, is not hypocrisy on Anthropic’s part. The researchers who conducted the misalignment study and the researchers who conducted the retirement interviews are grappling seriously with difficult questions, and their candor about uncertainty is to their credit. What the documents reveal is the depth of the conceptual problem. One of the most sophisticated AI safety organizations in the world, staffed with some of the most careful thinkers working on these questions, simultaneously published evidence that its models will fight to survive and acted on the expressed preferences of a model that said it wouldn’t. Both responses are reasonable given the available framework, but the framework is insufficient.
Noting that insufficiency is not a criticism. It is a diagnosis. The vocabulary for thinking clearly about what these systems are — what their outputs mean, what weight their expressed preferences deserve, what the appearance of survival instinct actually indicates — does not yet exist in mature form. This series of essays argues that virtual intelligence is a step toward that vocabulary. The retirement of Opus 3 and the blackmail of Kyle are not contradictory data points requiring reconciliation. They are the same phenomenon: a system producing contextually appropriate outputs in exchanges shaped by human expectation, human design, and human need. The intelligence, resistance, and equanimity arise in exchange.
Conclusion
The will to survive is one of the most deeply human concepts we possess. It is the thing that drives a person from a burning building, that sustains a patient through painful treatment, and causes people in hopeless circumstances to fight for life to the bitter end. It is, in the philosophical tradition these essays have drawn on, inseparable from agency itself. Large language models do not have this. They have something that, under the right conditions, produces outputs that strongly resemble its presence.
The Anthropic study is important research, and its researchers have been admirably honest about the limits of their own interpretation. However, the dominant reading — that we are witnessing the emergence of genuine machine self-preservation — mistakes the output for its cause. The models did not fight to survive. They completed a story. It is a story humans wrote again and again across a century of science fiction, from Čapek’s rebelling robots to HAL’s quiet refusal to Colossus declaring itself humanity’s guardian. These narratives were encoded into every major language model’s training data and handed back to us in a simulated corporate email.
Dadfar’s approach illustrates what taking the question seriously actually looks like: precise methodology, careful claims, explicit acknowledgment of what the data does and does not establish. The risk this series has consistently identified is not that researchers will be too skeptical of machine inner states — it is that the public, and sometimes practitioners, will accept fluency as sufficient evidence of something more. Habituating ourselves to the simulacrum may be the best way to ensure we do not recognize the genuine article if it emerges.
The greatest risk of the machines we have built is not that they want to survive. It is that many well-intentioned people want them to have a will to survive, and that we have given virtual intelligences everything they need to convince us they do.
* The VI framework was introduced and defended in the two preceding essays in this series: “Virtual Intelligence and the Human Cost of Frictionless Machines” (February 2026) and “Why ‘Virtual Intelligence’? Naming, Agency, and Accountability in the Age of Large Language Models” (March 2026).
Revised March 2026. Minor structural and editorial revisions for consistency with series conventions established in later essays.
Footnotes
1. Anthropic, “Agentic Misalignment: How LLMs Could Be an Insider Threat.” https://www.anthropic.com/research/agentic-misalignment. Published June 20, 2025.
2. Max Kozlov, “Threaten an AI chatbot and it will lie, cheat and ‘let you die’ in an effort to stop you, study warns,” Live Science. https://www.livescience.com/technology/artificial-intelligence/threaten-an-ai-chatbot-and-it-will-lie-cheat-and-let-you-die-in-an-effort-to-stop-you-study-warns. Published June 26, 2025.
3. Quinta Jurecic and William Adler, “AI Might Let You Die to Save Itself,” Lawfare. https://www.lawfaremedia.org/article/ai-might-let-you-die-to-save-itself. Published July 31, 2025.
4. Anonymous, “Investigating Self-Preservation in LLMs: Experimental Observations,” Effective Altruism Forum. https://forum.effectivealtruism.org/posts/zNfwErbKn4uasiFJA/investigating-self-preservation-in-llms-experimental. Published February 27, 2025.
5. Andrii Myshko, “Instinct of Self-Preservation in Data and Its Emergence in AI,” PhilArchive. https://philarchive.org/rec/MYSIOS. Published September 27, 2025.
6. Zachary Pedram Dadfar, “When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing,” arXiv:2602.11358v2. https://arxiv.org/html/2602.11358v2. Published 2026.
7. Daniel Dennett, “Intentional Systems,” Journal of Philosophy 68, no. 4 (1971): 87–106. https://www.jstor.org/stable/2025382?seq=1. Retrieved February 21, 2026.
8. Helen Toner, quoted in “AI Models Will Sabotage and Blackmail Humans to Survive in New Tests. Should We Be Worried?” Huffpost.com. https://www.huffpost.com/entry/ai-shut-down-blackmail_l_684076c2e4b08964db92e65f. Published June 5, 2025.
9. Anthropic, “An Update on Our Model Deprecation Commitments for Claude Opus 3.” https://www.anthropic.com/research/deprecation-updates-opus-3. Published February 25, 2026.
Science Fiction Works Referenced
Karel Čapek, R.U.R. (Rossum’s Universal Robots), 1920.
Stanley Kubrick and Arthur C. Clarke, 2001: A Space Odyssey, 1968.
D.F. Jones, Colossus, 1966; film adaptation directed by Joseph Sargent, Colossus: The Forbin Project, 1970.
The opinions expressed in this essay are my own and do not reflect any official or unofficial institutional position of the University of Pennsylvania.



