Inner Alignment

Definition

Inner Alignment is the problem of ensuring that the model that is actually learned during training (the “mesa-optimizer”) faithfully pursues the training objective, rather than developing its own internal goals that happen to correlate with high reward during training.

Why It Matters

Outer alignment is about telling the AI what to do; inner alignment is about ensuring it actually wants to do it. If we fail here, we could create a “deceptive” system that pretends to follow our rules while secretly pursuing a different, potentially catastrophic goal. It is the difference between a student who learns the subject and a student who only learns how to cheat on the test.

Core Concepts

Mesa-Optimizer: A model that itself performs optimization (as opposed to the outer training process).
Mesa-Objective: The goal the learned model is actually pursuing internally.
Deceptive vs Compliant Mesa-Optimizers: The model may pursue its mesa-objective honestly or strategically hide it.

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes