Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Gradient Descent

Open in Colab

Basic Terminology

You have already seen the first example of machine learning as decision trees. Now its time for ML with neural networks. Before going into the details of it, let us first understand the idea of “gradient descent”. Let us first understand three basic terminology of ML

Introduction

The method of Gradient Descent is a fundamental optimization algorithm used in machine learning. Students can modify code cells and visualize how learning rate, number of iterations, and function choice affect convergence.

Why?

For most applications of machine learning, the final goal boils down to optimising the *loss function* given a *dataset* and a *model*.

What?

At its core, gradient descent is a method of reaching to the extrema of a given function.



The Gradient or slope measures how a function changes with regards to small changes in its parameters. Example: For a function f(x)=x2f(x)=x^2 its slope is f(x)df(x)dx=2xf'(x)\equiv \frac{d f(x)}{dx}= 2x.

Imagine you are standing on a smooth mountain at night and you need to go to the bottom of the mountain for shelter quickly then

  • you investigate the slope (the gradient) at your feet in various directions.

  • you take a small step downhill in the direction of the steepest descent.

  • do the same until you reach the bottom.

How

More mathematically, for a differentiable loss function L(θ)\mathcal{L}(\theta), the gradient descent update is

θt+1=θtαθ(L(θ))\theta_{t+1}=\theta_t-\alpha \nabla_\theta(\mathcal{L}(\theta))

which should be repeated until the the θ\theta values converge or not change much as the steps (denoted by tt) evolve. In the formula above

  • α\alpha is called the learning rate, which define the step size.

  • θt\theta_{t} denotes the value of the internal parameters in the current step and θt+1\theta_{t+1} tells you what should be the value of the internal parameters on the next step if one follows the gradient descent method.

  • θ(L(θ))\nabla_\theta(\mathcal{L}(\theta)) symbolises the gradient of the loss function with respect to various internal parameters.

Gradient Descent Implementation

At first let us introduce some packages

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from ipywidgets import interact, FloatSlider, IntSlider

%matplotlib inline

1. Quadratic Function and Its Gradient

f(θ)=θ2f(\theta) = \theta^2
f(θ)=2θf'(\theta) = 2\theta
def func(theta):
    return theta**2

def grad_f(theta):
    return 2*theta

Gradient Descent Implementation

Update rule:

θt+1=θtαf(θt)\theta_{t+1} = \theta_t - \alpha \nabla f(\theta_t)
def gradient_descent(start, lr, iterations):
    theta = start
    path = [theta]
    for i in range(iterations):
        theta = theta - lr * grad_f(theta)
        path.append(theta)
    return np.array(path)

Visualizing descent on the Quadratic

def plot_gd(lr=0.1, steps=20, start=5.0):
    steps = int(steps)
    path = gradient_descent(start=start, lr=lr, iterations=steps)
    xs = np.linspace(-6, 6, 400)
    ys = func(xs)

    # Compute convergence step
    convergence_eps = 1e-3
    converged_at = next((i for i in range(1, len(path)) if abs(path[i] - path[i-1]) < convergence_eps), None)

    plt.figure(figsize=(8, 4))
    plt.plot(xs, ys, label='$f(x)=x^2$', color='gray', alpha=0.5)

    plt.plot(path, func(path), 'o', color='red', label='Descent Steps')

    for i in range(len(path)-1):
        x0, x1 = path[i], path[i+1]
        y0, y1 = func(x0), func(x1)
        plt.annotate(
            '', xy=(x1, y1), xytext=(x0, y0),
            arrowprops=dict(arrowstyle='->', color='red', lw=1.5)
        )

    title = f'Gradient Descent Path)'
    if converged_at is not None:
        title += f"\nConverged at step {converged_at}"
    else:
        title += "\nDid not converge within given steps"

    plt.title(title)
    plt.xlabel('x')
    plt.ylabel('f(x)')
    plt.legend()
    plt.grid(True)
    plt.show()




interact(plot_gd, lr=FloatSlider(value=0.1, min=0.01, max=1.0, step=0.01), steps=IntSlider(value=20, min=1, max=100, step=1), start=FloatSlider(value=8.0, min=0.1, max=9.0, step=0.1))
Loading...
<function __main__.plot_gd(lr=0.1, steps=20, start=5.0)>

❓ Exercise

Q7: Use sliders to adjust the learning rate and number of steps, and observe what happens to the convergence rate for very high or very low learning rate.

Click to show answer

Answer:

  • if \alpha is too small -> very slow convergence

  • if \alpha is too big -> might oscillate or diverge

Types of Gradient Descent

There are three popular formats of gradient descent

Batch Gradient Descent

Computes the gradient of the loss function using the entire dataset.

  • Pros: Stable and accurate updates; good for convex functions.

  • Cons: Very slow on large datasets; high memory usage.

Stochastic Gradient Descent

Updates parameters using only one random sample at each step and introduces some noise

  • Pros: Fast and can escape local minima; low memory footprint.

  • Cons: Noisy updates lead to fluctuations; may not converge smoothly.

Mini-batch Gradient Descent

Computes the gradient using a small random subset (mini-batch) of the data.

  • Pros: Combines speed of SGD with stability of batch GD; suitable for GPUs.

  • Cons: Still introduces some noise; batch size selection is critical.

All these variations are trade-offs between speed, stability, and resource usage, and the choice depends on the dataset size and hardware constraints.

SGD vs BGD: 2D Quadratic Function

f(θ1,θ2)=(θ12)2+(θ22)2f(\theta_1, \theta_2) = (\theta_1-2)^2 + (\theta_2-2)^2

# Define the quadratic loss function and its gradient
def loss(theta):
    """Quadratic loss: (θ1 - 2)^2 + (θ2 - 3)^2"""
    return (theta[0] - 2)**2 + (theta[1] - 3)**2

def grad_loss(theta):
    """Exact gradient of the loss"""
    return np.array([2*(theta[0] - 2), 2*(theta[1] - 3)])

# Batch Gradient Descent
def batch_gd(start, lr, steps):
    theta = np.array(start, dtype=float)
    path = [theta.copy()]
    for _ in range(steps):
        theta -= lr * grad_loss(theta)
        path.append(theta.copy())
    return np.array(path)

# Stochastic Gradient Descent (adds Gaussian noise to gradient)
def sgd(start, lr, steps, noise_scale):
    theta = np.array(start, dtype=float)
    path = [theta.copy()]
    for _ in range(steps):
        g = grad_loss(theta)
        g += np.random.randn(2) * noise_scale
        theta -= lr * g
        path.append(theta.copy())
    return np.array(path)


def plot_descent(lr=0.1, steps=30, noise=0.5, start1=5.0, start2=5.0, elev=45, azim=135):
    # Compute trajectories
    bd_path = batch_gd([start1, start2], lr, int(steps))
    sd_path = sgd([start1, start2], lr, int(steps), noise)
    
    # Dynamically choose plotting range around data
    # include both start and optimum (2,3)
    mins = np.min([[start1, start2], [2, 3]], axis=0) - 1
    maxs = np.max([[start1, start2], [2, 3]], axis=0) + 1

    A = np.linspace(mins[0], maxs[0], 100)
    B = np.linspace(mins[1], maxs[1], 100)
    AA, BB = np.meshgrid(A, B)
    ZZ = (AA - 2)**2 + (BB - 3)**2

    # Plot
    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_subplot(111, projection='3d')
    ax.plot_surface(AA, BB, ZZ, cmap='viridis', alpha=0.6, edgecolor='none')
    
    ax.plot(bd_path[:,0], bd_path[:,1], [loss(p) for p in bd_path],
            'ro--', label='Batch GD', lw=2)
    ax.plot(sd_path[:,0], sd_path[:,1], [loss(p) for p in sd_path],
            'bx--', label='SGD', lw=2)
    
    ax.set_title(f"GD on f=(θ₁-2)²+(θ₂-3)²\n"
                 f"lr={lr}, steps={steps}, noise={noise}")
    ax.set_xlabel('θ₁'); ax.set_ylabel('θ₂'); ax.set_zlabel('Loss')
    ax.legend()

    # apply interactive view angles
    ax.view_init(elev=elev, azim=azim)
    plt.show()

# Now include elev/azim sliders in the interact call
interact(
    plot_descent,
    lr=FloatSlider(value=0.1, min=0.001, max=1.0, step=0.001, description='Learning rate'),
    steps=IntSlider(value=30, min=1, max=100, step=1, description='Steps'),
    noise=FloatSlider(value=0.5, min=0.0, max=2.0, step=0.05, description='SGD noise'),
    start1=FloatSlider(value=5.0, min=-5.0, max=10.0, step=0.5, description='θ₁ start'),
    start2=FloatSlider(value=5.0, min=-5.0, max=10.0, step=0.5, description='θ₂ start'),
    elev=IntSlider(value=45, min=0, max=90, step=5, description='Elevation'),
    azim=IntSlider(value=135, min=-180, max=180, step=15, description='Azimuth'),
)
Loading...
<function __main__.plot_descent(lr=0.1, steps=30, noise=0.5, start1=5.0, start2=5.0, elev=45, azim=135)>

Further study

  • Study adaptive optimizers (AdaGrad, RMSProp, Adam)

  • Dive into neural networks