Gradient Descent - A minicourse in ML for Physicist

Basic Terminology¶

You have already seen the first example of machine learning as decision trees. Now its time for ML with neural networks. Before going into the details of it, let us first understand the idea of “gradient descent”. Let us first understand three basic terminology of ML

Loss Function

In principle, you are free to choose and try-out different functions $f$ with different internal parameters to fit a particular data. The question of goodness or badness of the function is determined by another function, called the Loss function.

It determines how well the the model

y=f(x;\theta)

(3)

predicts the data ( $x_i$ , $y_i$ ). One common example would be

\mathcal{L}(\theta)=\sum_i(y_i-f(x_i;\theta))^2

(4)

dubbed as the mean square error. There exists several different kinds of loss functions depending on the scope and aim of the model that we are working with. In all of them the basic target is to methodically change the internal parameters $\theta$ of the model $f(x;\theta)$ such that the the loss function $\mathcal{L}(\theta)$ is optimised.

Introduction¶

The method of Gradient Descent is a fundamental optimization algorithm used in machine learning. Students can modify code cells and visualize how learning rate, number of iterations, and function choice affect convergence.

Why?¶

For most applications of machine learning, the final goal boils down to optimising the *loss function* given a *dataset* and a *model*.

What?¶

At its core, gradient descent is a method of reaching to the extrema of a given function.

The Gradient or slope measures how a function changes with regards to small changes in its parameters. Example: For a function $f(x)=x^2$ its slope is $f'(x)\equiv \frac{d f(x)}{dx}= 2x$ .

Imagine you are standing on a smooth mountain at night and you need to go to the bottom of the mountain for shelter quickly then

you investigate the slope (the gradient) at your feet in various directions.
you take a small step downhill in the direction of the steepest descent.
do the same until you reach the bottom.

How¶

More mathematically, for a differentiable loss function $\mathcal{L}(\theta)$ , the gradient descent update is

\theta_{t+1}=\theta_t-\alpha \nabla_\theta(\mathcal{L}(\theta))

(5)

which should be repeated until the the $\theta$ values converge or not change much as the steps (denoted by $t$ ) evolve. In the formula above

$\alpha$ is called the learning rate, which define the step size.
$\theta_{t}$ denotes the value of the internal parameters in the current step and $\theta_{t+1}$ tells you what should be the value of the internal parameters on the next step if one follows the gradient descent method.
$\nabla_\theta(\mathcal{L}(\theta))$ symbolises the gradient of the loss function with respect to various internal parameters.

Gradient Descent Implementation¶

At first let us introduce some packages

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from ipywidgets import interact, FloatSlider, IntSlider

%matplotlib inline

1. Quadratic Function and Its Gradient¶

f(\theta) = \theta^2

(6)

f'(\theta) = 2\theta

(7)

def func(theta):
    return theta**2

def grad_f(theta):
    return 2*theta

Gradient Descent Implementation¶

Update rule:

\theta_{t+1} = \theta_t - \alpha \nabla f(\theta_t)

(8)

def gradient_descent(start, lr, iterations):
    theta = start
    path = [theta]
    for i in range(iterations):
        theta = theta - lr * grad_f(theta)
        path.append(theta)
    return np.array(path)

Visualizing descent on the Quadratic¶

def plot_gd(lr=0.1, steps=20, start=5.0):
    steps = int(steps)
    path = gradient_descent(start=start, lr=lr, iterations=steps)
    xs = np.linspace(-6, 6, 400)
    ys = func(xs)

    # Compute convergence step
    convergence_eps = 1e-3
    converged_at = next((i for i in range(1, len(path)) if abs(path[i] - path[i-1]) < convergence_eps), None)

    plt.figure(figsize=(8, 4))
    plt.plot(xs, ys, label='$f(x)=x^2$', color='gray', alpha=0.5)

    plt.plot(path, func(path), 'o', color='red', label='Descent Steps')

    for i in range(len(path)-1):
        x0, x1 = path[i], path[i+1]
        y0, y1 = func(x0), func(x1)
        plt.annotate(
            '', xy=(x1, y1), xytext=(x0, y0),
            arrowprops=dict(arrowstyle='->', color='red', lw=1.5)
        )

    title = f'Gradient Descent Path)'
    if converged_at is not None:
        title += f"\nConverged at step {converged_at}"
    else:
        title += "\nDid not converge within given steps"

    plt.title(title)
    plt.xlabel('x')
    plt.ylabel('f(x)')
    plt.legend()
    plt.grid(True)
    plt.show()




interact(plot_gd, lr=FloatSlider(value=0.1, min=0.01, max=1.0, step=0.01), steps=IntSlider(value=20, min=1, max=100, step=1), start=FloatSlider(value=8.0, min=0.1, max=9.0, step=0.1))

Loading...

<function __main__.plot_gd(lr=0.1, steps=20, start=5.0)>

❓ Exercise¶

Q7: Use sliders to adjust the learning rate and number of steps, and observe what happens to the convergence rate for very high or very low learning rate.

Types of Gradient Descent¶

There are three popular formats of gradient descent

Batch Gradient Descent¶

Computes the gradient of the loss function using the entire dataset.

Pros: Stable and accurate updates; good for convex functions.

Cons: Very slow on large datasets; high memory usage.

Stochastic Gradient Descent¶

Updates parameters using only one random sample at each step and introduces some noise

Pros: Fast and can escape local minima; low memory footprint.

Cons: Noisy updates lead to fluctuations; may not converge smoothly.

Mini-batch Gradient Descent¶

Computes the gradient using a small random subset (mini-batch) of the data.

Pros: Combines speed of SGD with stability of batch GD; suitable for GPUs.

Cons: Still introduces some noise; batch size selection is critical.

All these variations are trade-offs between speed, stability, and resource usage, and the choice depends on the dataset size and hardware constraints.

SGD vs BGD: 2D Quadratic Function¶

f(\theta_1, \theta_2) = (\theta_1-2)^2 + (\theta_2-2)^2

(9)


# Define the quadratic loss function and its gradient
def loss(theta):
    """Quadratic loss: (θ1 - 2)^2 + (θ2 - 3)^2"""
    return (theta[0] - 2)**2 + (theta[1] - 3)**2

def grad_loss(theta):
    """Exact gradient of the loss"""
    return np.array([2*(theta[0] - 2), 2*(theta[1] - 3)])

# Batch Gradient Descent
def batch_gd(start, lr, steps):
    theta = np.array(start, dtype=float)
    path = [theta.copy()]
    for _ in range(steps):
        theta -= lr * grad_loss(theta)
        path.append(theta.copy())
    return np.array(path)

# Stochastic Gradient Descent (adds Gaussian noise to gradient)
def sgd(start, lr, steps, noise_scale):
    theta = np.array(start, dtype=float)
    path = [theta.copy()]
    for _ in range(steps):
        g = grad_loss(theta)
        g += np.random.randn(2) * noise_scale
        theta -= lr * g
        path.append(theta.copy())
    return np.array(path)


def plot_descent(lr=0.1, steps=30, noise=0.5, start1=5.0, start2=5.0, elev=45, azim=135):
    # Compute trajectories
    bd_path = batch_gd([start1, start2], lr, int(steps))
    sd_path = sgd([start1, start2], lr, int(steps), noise)
    
    # Dynamically choose plotting range around data
    # include both start and optimum (2,3)
    mins = np.min([[start1, start2], [2, 3]], axis=0) - 1
    maxs = np.max([[start1, start2], [2, 3]], axis=0) + 1

    A = np.linspace(mins[0], maxs[0], 100)
    B = np.linspace(mins[1], maxs[1], 100)
    AA, BB = np.meshgrid(A, B)
    ZZ = (AA - 2)**2 + (BB - 3)**2

    # Plot
    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_subplot(111, projection='3d')
    ax.plot_surface(AA, BB, ZZ, cmap='viridis', alpha=0.6, edgecolor='none')
    
    ax.plot(bd_path[:,0], bd_path[:,1], [loss(p) for p in bd_path],
            'ro--', label='Batch GD', lw=2)
    ax.plot(sd_path[:,0], sd_path[:,1], [loss(p) for p in sd_path],
            'bx--', label='SGD', lw=2)
    
    ax.set_title(f"GD on f=(θ₁-2)²+(θ₂-3)²\n"
                 f"lr={lr}, steps={steps}, noise={noise}")
    ax.set_xlabel('θ₁'); ax.set_ylabel('θ₂'); ax.set_zlabel('Loss')
    ax.legend()

    # apply interactive view angles
    ax.view_init(elev=elev, azim=azim)
    plt.show()

# Now include elev/azim sliders in the interact call
interact(
    plot_descent,
    lr=FloatSlider(value=0.1, min=0.001, max=1.0, step=0.001, description='Learning rate'),
    steps=IntSlider(value=30, min=1, max=100, step=1, description='Steps'),
    noise=FloatSlider(value=0.5, min=0.0, max=2.0, step=0.05, description='SGD noise'),
    start1=FloatSlider(value=5.0, min=-5.0, max=10.0, step=0.5, description='θ₁ start'),
    start2=FloatSlider(value=5.0, min=-5.0, max=10.0, step=0.5, description='θ₂ start'),
    elev=IntSlider(value=45, min=0, max=90, step=5, description='Elevation'),
    azim=IntSlider(value=135, min=-180, max=180, step=15, description='Azimuth'),
)

Loading...

<function __main__.plot_descent(lr=0.1, steps=30, noise=0.5, start1=5.0, start2=5.0, elev=45, azim=135)>

Further study¶

Study adaptive optimizers (AdaGrad, RMSProp, Adam)
Dive into neural networks