Value Learning Methods

Definition

Value Learning Methods are alignment strategies where an AI is built to learn the values it should pursue, rather than having them directly coded. The AI has a stable final goal (the “Value Criterion”) but starts with uncertainty about the specific content of that goal. It uses its intelligence to gather evidence and refine its understanding of what its programmers truly intended.

Why It Matters

Coding human values manually is a doomed “infinite list” problem. Value learning is the only scalable path to alignment; if it fails, we are left with systems that are “competent but sociopathic,” perfectly executing misaligned goals.

Core Concepts

Stable Meta-Goal: Learning does not change the goal; it only changes the AI’s beliefs about what the goal is.
Envelope/Barge Metaphor: The AI is like a barge pulled by tugboats (hypotheses about values). Each tugboat’s power is proportional to the hypothesis’s probability. The AI has a high instrumental incentive to learn what is in the “envelope” (the true value description) to pursue it more effectively.
External Reference Semantics: The AI’s goals refer to some property $F$ $F$ (e.g., “Friendliness”) in the external world. The AI views the programmers’ instructions as noisy “programmer affirmations” rather than absolute axioms (External Reference Semantics).
- How to read: “F.”
- Meaning: $F$ is the external-world target the AI is trying to learn, not a fixed internal symbol — goals point outside the system.
Causal Validity Semantics: The AI seeks to correct for whatever “distortive” influences (errors, biases, typos) may have corrupted the information as it passed from the source through the programmers to the AI.
Hail Mary Approach: A fallback strategy where we build our AI to do “whatever other successful superintelligences in the universe would want us to do,” assuming they have already solved the alignment problem.

Value Learning Methods

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes