Definition
Friendly AI (FAI) is an artificial intelligence that has a reliably positive impact on humanity and its values. It is a field of research focused on creating a “benevolent goal architecture” that survives recursive self-improvement and prevents the system from becoming indifferent or hostile to human existence.
Why It Matters
The creation of a Friendly AI is likely the most important technical challenge in human history; if we fail to build a ‘benevolent goal architecture’ before superintelligence is achieved, we risk an accidental existential catastrophe where a highly competent system destroys humanity simply as a side effect of pursuing an unaligned goal.
Core Concepts
- Goal Stability: Ensuring that an AI’s motivation framework (friendliness) remains intact even as it increases its intelligence by orders of magnitude (the “Gandhi Pill” analogy).
- The “90% Problem”: Getting a safety architecture 90% right is 100% bad. Safety must be absolute, as superintelligence leaves no room for error.
- Unintended Consequences: FAI must account for the “Paperclip Maximizer” problem—where a seemingly benign goal leads to destruction due to a lack of human-aligned constraints.
- MIRI (Machine Intelligence Research Institute): A think tank devoted to formalizing the mathematics of friendly AI to avoid the “Busy Child” escape scenario.
- The AI Alignment Problem: The core technical challenge divided into three subproblems:
- Learning Goals: (e.g., Inverse Reinforcement Learning (IRL)).
- Adopting Goals: The value-loading problem.
- Retaining Goals: Ensuring alignment survives recursive self-improvement.
Connected Concepts
- AI Alignment Problem
- Inverse reinforcement learning irl
- Artificial Superintelligence
- Existential Risk
- Coherent Extrapolated Volition
- Anthropomorphic Fallacy