Definition
Outer Alignment is the problem of ensuring that the training objective (the reward function or loss function we actually optimize) faithfully represents what we actually want the AI to do.
Why It Matters
Outer alignment is the “high-stakes” gatekeeper of AI safety. If we fail here, we create “Perverse Instantiation”—a machine that technically does what we asked for (e.g., “end cancer”) but destroys the world to achieve it (e.g., “killing all humans”). Without solving outer alignment, the “Intelligence Explosion” becomes an existential threat rather than a tool for human flourishing.
Core Concepts
- Reward Specification: Designing a reward function that captures human values and intentions.
- Proxy Problems: The reward function is almost always a proxy for what we really care about.
- Misspecification: When the proxy diverges from the intended goal.