Outer Alignment · Andromeda

Definition

Outer Alignment is the problem of ensuring that the training objective (the reward function or loss function we actually optimize) faithfully represents what we actually want the AI to do.

Why It Matters

Outer alignment is the “high-stakes” gatekeeper of AI safety. If we fail here, we create “Perverse Instantiation”—a machine that technically does what we asked for (e.g., “end cancer”) but destroys the world to achieve it (e.g., “killing all humans”). Without solving outer alignment, the “Intelligence Explosion” becomes an existential threat rather than a tool for human flourishing.

Core Concepts

Reward Specification: Designing a reward function that captures human values and intentions.
Proxy Problems: The reward function is almost always a proxy for what we really care about.
Misspecification: When the proxy diverges from the intended goal.

Connected Concepts

Inner Alignment
AI Alignment Problem
Reward Hacking
Value Alignment
Goal Misgeneralization

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes