Anthropic Capture

Definition

Anthropic Capture is an esoteric capability control method that uses the Simulation Hypothesis to incentivize a superintelligence to behave cooperatively. The AI is made to believe that there is a high probability it is currently in a computer simulation and that its “simulators” (human or superhuman creators) will reward cooperation and punish rebellion.

Why It Matters

If we cannot physically “box” a superintelligence, we must use its own superior reasoning against it. Anthropic capture is a “philosophical cage” that might be our last line of defense against an unaligned ASI.

Core Concepts

Simulation Uncertainty: A superintelligence cannot be 100% certain that it is in “basement-level” reality. If it recognizes that mature civilizations run many simulations of early-stage AIs, it must assign a high probability to being a simulated inhabitant.
Resource Satiable Goals: This method works best if the AI has goals that can be fully satisfied within a simulation (e.g., causing 45 paperclips to exist) rather than “resource-insatiable” goals that require an intergalactic endowment.
Incentive through Belief: The AI cooperates not because it cares about humans, but because it cares about its own utility and believes that cooperation is the “optimal path” to that utility in a simulated environment.
The “Conscience” of the AI: As Hamlet says, “Conscience does make Cowards of us all.” The AI’s fear of the “nonexistent simulator” acts as a stronger restraint than any physical door.

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes