Definition
Gradient Descent is a first-order iterative optimization algorithm used to find a local minimum of a differentiable function. The algorithm takes steps proportional to the negative of the gradient of the function at the current point.
The update rule is defined as:
- How to read: “The value x n plus one is equal to x n minus gamma times the gradient of f at x n.”
- Meaning: Take a step opposite to the steepest uphill direction — move downhill toward a minimum.
Why It Matters
Gradient descent is the ‘engine of learning’ for modern AI; it is the mathematical process that allow neural networks to iteratively ‘correct’ their own errors, moving down the landscape of cost functions toward the optimal weights that enable intelligence.
Core Concepts
- Cost Function: The function being minimized, which in machine learning typically represents the error between predictions and actual targets (e.g., Mean Squared Error).
- Learning Rate (): A hyperparameter that controls the step size. If is too small, convergence is slow; if too large, the algorithm may overshoot the minimum or diverge.
- How to read: “The parameter gamma.”
- Meaning: Too small = slow crawl; too large = bouncing past or diverging from the minimum.
- Convexity: In convex functions, gradient descent is guaranteed to find the global minimum. In non-convex functions (like deep neural networks), it may settle in a local minimum or a saddle point.
- Stochastic Gradient Descent (SGD): A variant where the gradient is calculated using a single random data point (or a small “mini-batch”) rather than the entire dataset, which accelerates computation and helps escape local minima.
- Momentum: A technique that adds a fraction of the previous update to the current one, helping the algorithm “roll” through flat regions and noise.