Human Compatible

Definition

Human Compatible refers to a framework for AI design, proposed by Stuart Russell, that ensures machines are provably beneficial to humans. It shifts the paradigm from “machines that achieve their own objectives” to “machines that achieve human objectives,” accounting for the fact that human preferences are often implicit and evolving.

Why It Matters

This concept argues that we must design AI with objectives that are inherently aligned with human values, rather than just “optimizing” for a fixed goal. This shift in thinking is essential for ensuring that superintelligent systems remain beneficial and do not inadvertently cause catastrophic harm.

Core Concepts

The Value Alignment Problem: The risk that an AI system will optimize for a literal goal (e.g., “make paperclips”) in a way that causes catastrophic side effects because it lacks human common sense or values.
Three Principles of Beneficial AI:
1. The machine’s only objective is to maximize the realization of human preferences.
2. The machine is initially uncertain about what those preferences are.
3. The ultimate source of information about human preferences is human behavior.
Inverse Reinforcement Learning (IRL): The technical mechanism by which a machine observes human choices to infer the underlying reward function being optimized.
The Off-Switch Problem: A “human-compatible” machine should allow itself to be switched off because it understands that if a human wants to stop it, it must be because the machine’s current actions are violating human preferences it doesn’t yet fully understand.

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes