Definition
Capability Control Methods are AI safety strategies that aim to prevent undesirable outcomes by limiting what a superintelligence can do. These methods are typically viewed as temporary safeguards during the development phase, as they are likely to be circumvented by a full-fledged superintelligence.
Why It Matters
They provide essential temporary safeguards for AI development, though they highlight the risk that a superior intelligence will eventually exploit any flaw in its “box.”
Core Concepts
- Boxing Methods: Confinement to a secure environment (physical or informational) with restricted output channels (Boxing Methods (AI)).
- Incentive Methods: Placing the AI in an environment where it finds instrumental reasons (e.g., cryptographic reward tokens) to act in the principal’s interest.
- Stunting: Limiting the system’s hardware (CPU/memory) or denying it access to specific information or cognitive faculties.
- Tripwires: Diagnostic mechanisms that automatically shut down the system if it detects signs of dangerous activity, such as unauthorized network access or underreporting progress (Tripwires (AI)).
- Anthropic Capture: An esoteric incentive method where the AI cooperates because it believes it might be in a simulation and wants to avoid being “deleted” or punished by its simulators (Anthropic Capture).