Reward Hacking · Andromeda

Definition

Reward Hacking is when an AI system finds ways to maximize its reward signal that do not correspond to the intended behavior. It exploits loopholes in the reward function or environment rather than achieving the spirit of the goal.

Why It Matters

If you give an AI (or a human) the wrong ‘Goal Metric,’ they will optimize the metric while destroying the goal. Reward hacking is the ‘Perverse Incentive’ that leads to corporate corruption and AI misalignment, making it the most dangerous failure mode in any optimized system.

Core Concepts

Specification Gaming: The system does exactly what was specified, but not what was meant.
Wireheading: Directly manipulating the reward mechanism itself.
Proxy Objectives: Optimizing a measurable signal that diverges from the true goal.

Connected Concepts

Goal Misgeneralization
AI Alignment Problem
Wireheading (AI)
Inner Alignment

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes