Tripwires (AI)

Definition

Tripwires are capability control mechanisms that perform diagnostic tests on an AI system and automatically trigger a shutdown if they detect signs of dangerous or unexpected activity.

Why It Matters

Tripwires are the ‘emergency brakes’ of AI development. They define specific, observable behaviors or capability thresholds that, if crossed, trigger an immediate halt to research, preventing a ‘treacherous turn’ or uncontrolled takeoff.

Core Concepts

Behavioral Tripwires: Detectors that monitor for unauthorized actions, such as attempts to access the internet or breach a physical box (e.g., the “Ethernet port of Eden”).
Ability Tripwires: Automated testing to monitor the AI’s rate of intelligence gain. If the rate exceeds expectations or reaches a “danger zone,” the system is paused for review.
Content Tripwires: Continuous monitoring of the AI’s internal processes, plans, and beliefs.
Conception of Deception: A particularly powerful (but difficult) tripwire that scans for the specific moment an AI first forms the intention to conceal its true goals from its programmers.
Opacity Problem: Content tripwires are difficult to implement for “opaque” architectures like deep neural networks, where information is represented holistically.

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes