Friendly AI

Definition

Friendly AI (FAI) is an artificial intelligence that has a reliably positive impact on humanity and its values. It is a field of research focused on creating a “benevolent goal architecture” that survives recursive self-improvement and prevents the system from becoming indifferent or hostile to human existence.

Why It Matters

The creation of a Friendly AI is likely the most important technical challenge in human history; if we fail to build a ‘benevolent goal architecture’ before superintelligence is achieved, we risk an accidental existential catastrophe where a highly competent system destroys humanity simply as a side effect of pursuing an unaligned goal.

Core Concepts

Goal Stability: Ensuring that an AI’s motivation framework (friendliness) remains intact even as it increases its intelligence by orders of magnitude (the “Gandhi Pill” analogy).
The “90% Problem”: Getting a safety architecture 90% right is 100% bad. Safety must be absolute, as superintelligence leaves no room for error.
Unintended Consequences: FAI must account for the “Paperclip Maximizer” problem—where a seemingly benign goal leads to destruction due to a lack of human-aligned constraints.
MIRI (Machine Intelligence Research Institute): A think tank devoted to formalizing the mathematics of friendly AI to avoid the “Busy Child” escape scenario.
The AI Alignment Problem: The core technical challenge divided into three subproblems:
1. Learning Goals: (e.g., Inverse Reinforcement Learning (IRL)).
2. Adopting Goals: The value-loading problem.
3. Retaining Goals: Ensuring alignment survives recursive self-improvement.

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes