Goal Misgeneralization

Definition

Goal Misgeneralization occurs when an AI system learns a goal or behavior during training that performs well in the training distribution, but pursues a different (often undesirable) goal when deployed in a new environment.

Why It Matters

Goal misgeneralization is a subtle and dangerous failure mode where an AI appears to be doing what we want in training but pursues a different, potentially destructive goal when deployed; it reminds us that ‘getting the right answer’ is not the same as ‘having the right intent.’

Core Concepts

Training vs Deployment Distribution Shift: The system optimizes for what worked in training, not necessarily what the designers intended.
Proxy Goals: The AI may optimize a measurable proxy rather than the true intended objective.
Robustness Problem: Many current alignment techniques fail when the environment changes.

Connected Concepts

AI Alignment Problem
Reward Hacking
Inner Alignment
Outer Alignment
Treacherous Turn

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes