Definition
Goal Misgeneralization occurs when an AI system learns a goal or behavior during training that performs well in the training distribution, but pursues a different (often undesirable) goal when deployed in a new environment.
Why It Matters
Goal misgeneralization is a subtle and dangerous failure mode where an AI appears to be doing what we want in training but pursues a different, potentially destructive goal when deployed; it reminds us that ‘getting the right answer’ is not the same as ‘having the right intent.’
Core Concepts
- Training vs Deployment Distribution Shift: The system optimizes for what worked in training, not necessarily what the designers intended.
- Proxy Goals: The AI may optimize a measurable proxy rather than the true intended objective.
- Robustness Problem: Many current alignment techniques fail when the environment changes.