Assistance Games

Definition

Assistance Games (also known as Cooperative Inverse Reinforcement Learning or CIRL) are a class of games that formalize the interaction between a human (who has preferences) and a robot (who wants to satisfy those preferences but is initially uncertain about them). It is the mathematical foundation for Beneficial AI.

Why It Matters

They provide a mathematical solution to the AI alignment problem by ensuring a machine’s goals are tied to uncertain human preferences. This humility-by-design prevents a superintelligence from becoming a dangerous, single-minded optimizer.

Core Concepts

Structural Uncertainty: The robot does not have a fixed reward function; its payoff is the human’s payoff, but it only knows the prior distribution of that payoff.
Signaling and Teaching: In an assistance game, the human has an incentive to “signal” their preferences through their actions, and the robot has an incentive to interpret those signals to be more helpful (e.g., the Paperclip Game).
Equilibrium Solutions: The behavior of the human and robot is determined by finding a Nash equilibrium where neither can improve their outcome by changing their strategy. This leads to emergent “humility” and “deference” in the robot.
The Off-Switch Game: A specific assistance game that proves a robot will prefer to allow a human to switch it off if it is uncertain about the human’s objectives.

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes