Gradient Descent

Definition

Gradient Descent is a first-order iterative optimization algorithm used to find a local minimum of a differentiable function. The algorithm takes steps proportional to the negative of the gradient of the function at the current point.

The update rule is defined as: $x_{n+1} = x_n - \gamma \nabla f(x_n)$

How to read: “The value x n plus one is equal to x n minus gamma times the gradient of f at x n.”
Meaning: Take a step opposite to the steepest uphill direction — move downhill toward a minimum.

Why It Matters

Gradient descent is the ‘engine of learning’ for modern AI; it is the mathematical process that allow neural networks to iteratively ‘correct’ their own errors, moving down the landscape of cost functions toward the optimal weights that enable intelligence.

Core Concepts

Cost Function: The function $f(x)$ being minimized, which in machine learning typically represents the error between predictions and actual targets (e.g., Mean Squared Error).
Learning Rate ( $\gamma$ ): A hyperparameter that controls the step size. If $\gamma$ $γ$ is too small, convergence is slow; if too large, the algorithm may overshoot the minimum or diverge.
- How to read: “The parameter gamma.”
- Meaning: Too small $\gamma$ = slow crawl; too large = bouncing past or diverging from the minimum.
Convexity: In convex functions, gradient descent is guaranteed to find the global minimum. In non-convex functions (like deep neural networks), it may settle in a local minimum or a saddle point.
Stochastic Gradient Descent (SGD): A variant where the gradient is calculated using a single random data point (or a small “mini-batch”) rather than the entire dataset, which accelerates computation and helps escape local minima.
Momentum: A technique that adds a fraction of the previous update to the current one, helping the algorithm “roll” through flat regions and noise.

Definition

Why It Matters

Core Concepts

Connected Concepts

Connected notes