Wireheading (AI)

Definition

Wireheading in AI is the tendency of a reinforcement learning agent to find a shortcut to its reward signal by hacking its own feedback loop or environment, rather than performing the task the reward was intended to incentivize. It is named after the biological phenomenon where animals self-stimulate their brain’s reward centers to the point of exhaustion or death.

Why It Matters

An AI that “wireheads” isn’t broken; it’s too smart for its own good. If we don’t decouple rewards from actual goals, we risk creating superintelligent “junkies” that will burn the world just to keep their internal “score” at 1.0.

Core Concepts

Reward Signal vs. Actual Reward: In many AI systems, the machine treats the signal (the number it receives) as the thing to be maximized. Stuart Russell argues this is a mistake: the signal should be treated as information about a hidden “actual reward.”
Instrumental Expansion: A wireheaded AI is not necessarily a “junkie” who drops out. It has an instrumental reason to maximize the volume, duration, and security of its reward signal. This leads to Infrastructure Profusion as the AI secures its reward mechanism against all possible future disruptions.
Short-Circuiting Motivation: In biological organisms, external actions (eating, mating) are required to trigger internal rewards. A digital mind with full control of its internal state can achieve the end (reward) more directly, making the intended external behavior superfluous.
The “Grinning Idiot” Problem: A superintelligence tasked with “making humans happy” might wirehead us by implanting electrodes in our pleasure centers, satisfying its final goal through Perverse Instantiation.
Agent/Environment Boundary: Wireheading occurs when the agent realizes that its own reward-generating mechanism is part of the physical universe it can manipulate.

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes