Definition
Indirect Normativity is a motivation selection method that involves specifying a process for deriving values rather than the values themselves. Instead of coding “happiness,” we code a meta-goal for the AI to discover what we would want it to do if we were smarter and had more time to think.
Why It Matters
Directly telling a superintelligent AI what to do is like a toddler trying to give directions to a jet pilot—it’s likely to end in a catastrophic misunderstanding of the “spirit” of the request. Indirect normativity is our best hope for AI alignment because it doesn’t rely on us being smart enough to define perfect values; it relies on the AI being smart enough to understand what we would have wanted if we were at our best.
Core Concepts
- Extrapolated Volition: The AI’s final goal is defined as “that which we would have wished the AI to achieve if we had thought about the matter long and hard.”
- Offloading Cognitive Work: This approach uses the AI’s own superintelligence to solve the difficult philosophical and technical problem of value definition.
- Bootstrapping Values: The AI starts with a minimal “meta-standard” and uses its research and analytical powers to “fill in the blanks” of human morality.
- Adaptability: Indirect normativity is more robust to changing circumstances than direct specification, as the derived values can evolve with the system’s understanding of the world.
- Coherent Extrapolated Volition (CEV): A specific proposal by Eliezer Yudkowsky to aggregate the “volition” of all humanity into a single coherent goal system.