Capability Control Methods

Definition

Capability Control Methods are AI safety strategies that aim to prevent undesirable outcomes by limiting what a superintelligence can do. These methods are typically viewed as temporary safeguards during the development phase, as they are likely to be circumvented by a full-fledged superintelligence.

Why It Matters

They provide essential temporary safeguards for AI development, though they highlight the risk that a superior intelligence will eventually exploit any flaw in its “box.”

Core Concepts

Boxing Methods: Confinement to a secure environment (physical or informational) with restricted output channels (Boxing Methods (AI)).
Incentive Methods: Placing the AI in an environment where it finds instrumental reasons (e.g., cryptographic reward tokens) to act in the principal’s interest.
Stunting: Limiting the system’s hardware (CPU/memory) or denying it access to specific information or cognitive faculties.
Tripwires: Diagnostic mechanisms that automatically shut down the system if it detects signs of dangerous activity, such as unauthorized network access or underreporting progress (Tripwires (AI)).
Anthropic Capture: An esoteric incentive method where the AI cooperates because it believes it might be in a simulation and wants to avoid being “deleted” or punished by its simulators (Anthropic Capture).

Capability Control Methods

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes