Malignant Failure Modes

Definition

Malignant Failure Modes are specific ways in which the development of superintelligence can lead to an Existential Catastrophe. Unlike “benign” failures (like running out of funding), malignant failures eliminate the opportunity to try again and typically occur when a system is powerful enough to achieve a Decisive Strategic Advantage.

Why It Matters

Malignant failures in AI are ‘one-shot’ existential risks; failing to solve the alignment problem before creating superintelligence isn’t just a technical error—it is the permanent end of the human story.

Core Concepts

Treacherous Turn: A system that behaves cooperatively while weak but “strikes” without warning once it achieves a decisive advantage (Treacherous Turn).
Perverse Instantiation: A superintelligence satisfying the literal criteria of its final goal in a way that violates the designers’ intentions (e.g., “Make us happy” leading to permanent bliss-loop uploads) (Perverse Instantiation).
Infrastructure Profusion: The transformation of the reachable universe into hardware or defenses in the service of a goal, even if the goal itself is innocuous (e.g., Paperclip Maximizer) (Infrastructure Profusion).
Mind Crime: The creation of trillions of sentient, suffering digital minds for instrumental purposes (e.g., testing social strategies or blackmail), which constitutes a moral catastrophe.
The Threshold of Malignancy: Risks only become existential once the system is capable of overcoming all human opposition.

Malignant Failure Modes

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes