Beneficial AI

Definition

Beneficial AI (or Human-Compatible AI) is a framework proposed by Stuart Russell where machines are designed such that their actions can be expected to achieve human objectives. Unlike the Standard Model of AI, the machine is explicitly uncertain about what those objectives are.

Why It Matters

It provides a framework for building machines that are safe by default because they admit they don’t know exactly what we want. This humility prevents the scenario where an AI follows an order so perfectly that it destroys human values.

Core Concepts

The Three Principles of Beneficial AI:
1. Pure Altruism: The machine’s only objective is to maximize the realization of human preferences.
2. Humble Machines (Uncertainty): The machine is initially uncertain about what those preferences are.
3. Behavioral Inference: The ultimate source of information about human preferences is human behavior.
Assistance Games (CIRL): A game-theoretic formalization where a human and an AI work together to achieve the human’s goal, even if neither knows it perfectly at the start.
Deference to Humans: Because the machine is uncertain, it has a positive incentive to listen to humans, ask for permission, and allow itself to be switched off (see Off-Switch Problem (AI)).
Provable Safety: By mathematically formalizing uncertainty about the objective, we can prove that the machine will remain beneficial even as its intelligence surpasses ours.

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes