Gradient descent

Now that we have the empirical risk \(\mathcal{L}_\mathcal{D}(\boldsymbol{\Theta})\) as objective, we proceed to the actual optimization algorithm. Given the current parameters \(\boldsymbol{\Theta} \in \mathbb R^M\) of the network, we can imagine the network to be sitting on a point \((\boldsymbol{\Theta}, \mathcal L_{\mathcal{D}}(\boldsymbol{\Theta}))\) on a surface in \(\mathbb R^M \times \mathbb R.\) The surface will generally vary for different samples of the training data. Training is equivalent to finding the minimum of this surface. Gradients arise in deep learning when making the following first-order approximation:

\[\Delta \mathcal L_{\mathcal{D}} \approx \sum_k \left(\frac{\partial \mathcal L_{\mathcal{D}}}{ \partial {\Theta}_k} \right) \Delta {\Theta}_k = \left( \nabla_{\boldsymbol{\Theta}}\, \mathcal L_{\mathcal{D}} \right) \cdot \Delta {\boldsymbol{\Theta}}.\]

It follows that \(-\nabla_{\boldsymbol{\Theta}}\, \mathcal L_{\mathcal{D}}\) is the direction of steepest descent at the current point \((\boldsymbol{\Theta}, \mathcal L_{\mathcal{D}})\) in the surface. Gradient descent (GD) is defined by the update rule:

\[ \boldsymbol{\Theta}^{t+1} = \boldsymbol{\Theta}^t - \eta\; \nabla_{\boldsymbol{\Theta}}\, \mathcal L_{\mathcal{D}}(\boldsymbol{\Theta}^t) \]

where the learning rate \(\eta > 0\) controls the step size. Note that finding a good set of initial weights \(\boldsymbol{\Theta}^0 \in \mathbb{R}^M\) is crucial since networks has lots of internal symmetries and can diverge in the early stages of training. Later on we will see that other optimization algorithms in deep learning practice are just modifications of GD.

import numpy as np

def loss(w, X, y):
    return ((X @ w - y)**2).mean()

def grad(w, X, y, B=None):
    """Gradient step for the MSE loss function"""
    dw = 2*((X @ w - y).reshape(-1, 1) * X).mean(axis=0)
    return dw / np.linalg.norm(dw)

def grad_descent(w0, X, y, eta=0.1, steps=10):
    """Return sequence of weights from GD."""
    w = np.zeros([steps, 2])
    w[0, :] = w0
    for j in range(1, steps):
        u = w[j-1, :]
        w[j, :] = u - eta * grad(u, X, y)
    return w


# Generate data
n = 1000
X = np.zeros((n, 2))
X[:, 1] = np.random.uniform(low=-1, high=1, size=n)
X[:, 0] = 1
w_min = np.array([-1, 3])
y = (X @ w_min) + 0.05 * np.random.randn(n)  # data: y = -1 + 3x + noise

# Gradient descent
w_init = [-4, -4]
w_step = grad_descent(w_init, X, y, eta=1.0, steps=30)

../../../_images/2e2163ed9cf29ec989d756c43f2b26c4f100459acfc7bb59092506dfb44b40cb.svg

Fig. Loss surface (left) and gradient descent (right) of a linear function with MSE loss.