Visualizing The Loss Landscape

What is a Loss Landscape? A visual exploration

In this post I will explain what a loss landscape is and how it relates to training a model to make predictions. Creating these visualizations after reading Visualizing the loss landscape of neural networks helped me understand that paper and appreciate the complexity of training neural networks. Hopefully this post will give you some useful intuitions about machine learning and allow you to thoroughly enjoy Javier Ideami's nice loss landscape images.

Figure 1.) The loss landscape for Model 3's loss landscape for the XOR dataset from post 1.

In my first post we saw how models make predictions, but skipped over how they are trained. If you haven't read that post yet, I recommend that you do. This post will make use of the same models and datasets.

What is Loss?

When a model makes predictions, we need some way to quantify how good those predictions are.

What's wrong with accuracy? It's a common and familiar metric. If a model makes 10 predictions and got 9 of them correct, that would be an accuracy of 90%. But what if the desired predictions are either 0 or 1 (with a decision boundary of 0.5)? For a desired value of 1, one model might predict 0.6 and another might be 0.99. Both models would be correct (above 0.5) for that example, but in order to train, there needs to be some way to say that the latter makes a better prediction.

We need a loss function that will measure how close the desired output $y$ and the model output $\hat y$ are to each other. As the model output gets closer to the desired output, the loss should decrease until the values match. The process of changing a model to minimize the loss is called optimization. In machine learning this is typically done with some variant of stochastic gradient descent (SGD) and won't be the focus of this post.

$$ \begin{equation} \mathcal{L} = (y - \hat y)^2 \end{equation} $$

Loss Function.) Squared loss, or squared error. One of many common loss functions

Figure 2.) Loss vs. difference in outputs ($y - \hat y$).

Figure 2 shows how the loss changes as the predictions get further away from the desired outputs. For example, if the desired output was 0 and the prediction was 2, that would result in a loss value of 4. For this post the loss values used to construct the landscapes will be an average of the loss for all examples in the dataset.

Models as points

In post 1 we looked at the inputs as points. We also saw notation for a model as a function that took in points and parameters (values for weights and biases) to produce predictions: $\textbf{P} = f(\textbf{D}, \theta)$. Let's conceptualize the parameters θ as a point in $n$ dimensional space, where $n$ is the number of parameters in the model.

Model 1, a matrix multiplication of the dataset and a $2 \times 1$ matrix, can be represented as the point $(q, s)$.

$$ \begin{equation} \textbf{D}_{n \times 2} \left[ \begin{array}{} q \\ s \\ \end{array} \right] \Rightarrow \left[ \begin{array}{} \vdots \\ \hat y \\ \vdots\\ \end{array} \right]\\ \end{equation} $$

Model 1.) 2D->1D projection

Figure 3.) Model 1 parameters as points during training (black to purple) on the diagonal dataset.

Figure 4.) The loss of models near the start and end points of training.

From figures 3 through 6 we can see how the loss decreases from the start to the end of training. The initial model could have been any point in 2D space, the parameter space, and if trained properly should always end up at a lower point on the loss landscape than it started. Figures 4, 5, and 6 all show the same data in slightly different ways.

Figure 5.) Loss (z axis) vs. models as points in the X-Y plane.

Figure 6.) Contour plot of the loss landscape.

Why train? Can't we pick the point with the lowest loss in the landscape? Calculating the loss for each point in figure 4 isn't free. For this model and dataset, it takes slightly more time than optimizing with SGD. As models get larger, searching for an optima in this way quickly becomes intractable.

How do we visualize larger models? Model 2 from visualizing NN predictions had an additional parameter $b$, so if we wanted to plot all of the parameters and the loss, $(q, s, b, \mathcal{L})$ we would need 4 dimensions. But model 3 has 13 parameters, and models used on real problems can have thousands to hundreds of ~~millions~~ billions of parameters. Since we can't visualize every dimension of those loss landscapes, we'll need a way to visualize sections of the landscape that might be interesting.

Slicing the loss landscape

In the remaining sections we'll see 1D, 2D, and 3D slices of loss landscapes. We've seen models as 2D points in figures 3 through 6, which can be considered 2D slices. To understand that, let's look at a 1D slice of that model's 2D landscape for the Diagonal dataset.

Figure 7.) Interpolating from model A to model B

Figure 8.) Loss vs. position in the loss landscape relative to A and B. Instead of using $q$ and $s$ as the axes, we use $\alpha$.

In order to calculate the parameter values θ for each point along a slice, we'll take a weighted average (linear interpolation) of the parameters for model A, $θ_A$, and Model B, $θ_B$.

$$ \theta = (1-\alpha)θ_A + (\alpha)θ_B $$

Equation 1.) Linear interpolation of point $θ_A$ and point $θ_B$.
When $\alpha = 0 \rightarrow \theta = θ_A$ and when $\alpha = 1 \rightarrow \theta = θ_B$

In figure 7, the points along the curved arc show the parameters of the model as the optimizer updates them in order to minimize the loss. The pink dot traces the line shown in figure 8. We could have chosen any slice, but if we're interested in the model's behavior while training, a slice showing the start and end point seems useful.

Visualizing More Complex Landscapes

We've seen how to take a 1D slice of a 2D parameter space. Now we'll look at 2D and 3D slices of 13 dimensional parameter spaces for Model 3, a 2 layer neural network.

To make a 2D slice, we'll need 2 directions in space to define the slice. If we chose a slice parallel to the first slice, the resulting landscape would be misleadingly symmetrical and unintuitive, since we will visualize each slice on perpendicular axes.

Figure 9.) Nearly parallel slices of model 3's Loss Landscape on the XOR dataset.

Figure 10.) Perpendicular slices for Model 1's parameter space chosen using PCA.

Having slices perpendicular to each other will allow us to visualize more interesting features of the loss landscape. For the figures below I've used principal component analysis (PCA) to pick good slices. The result of PCA over the set of parameters at each training step gives vectors (directions in space) that indicate the major movements of the model during training. Figure 10 shows slices chosen with the help of PCA.

Figure 11.) 2D slice of Model 3 on the XOR dataset as a contour plot

Figure 12.) 2D slice of model 3 on the XOR dataset as a 3D surface. z axis is loss. colored by loss as well.

In figure 12 we see 2 low points, the lowest is where the model ended up after training. The other appears to be what is referred to as a local minimum, a point where any slight change in the model results in a higher loss. This point turns out to not actually be a local minimum. It just so happens that this slice misses points near there that would have a lower loss. I confirmed this by starting the model at that point and verifying that more training resulted in an even lower loss. The animations below show how some apparent local minima get lower as the 2D slice moves.

Figure 13.) 3D slice of model 3 on the XOR dataset. $\alpha_1$ and $\alpha_2$ are the x and y axes and each frame of the animation is a step along the $\alpha_3$ slice.

Figure 14.) Same data as figure 13, visualized as a 2D slice of a 3D cube (Like visualizing an MRI).

The takeaway from figure 13 and 14 is that 2D slices of loss landscapes may be misleading. By extension, we should assume that 3D slices for models with 13 parameters or more are probably not perfect representations of the model's whole landscape.

So why do we care? Most applications of machine learning use loss minimization as the primary objective of training and a large portion of research on ML can be seen as finding algorithms that are better at navigating loss landscapes or finding models with loss landscapes that are easier to navigate. Visualizing the loss landscape of neural networks showed that adding skip connections can make the loss landscape smoother, and therefore easier to navigate. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima showed that training with smaller batches of data results in the models ending up in low points where the surrounding bowl is much wider, which correlates with a model performing well on unseen data. These aren't proofs of why these tools and procedures result in better models, but they are useful relationships to understand.

In summary

We defined loss as the result of a loss function on a model's output and the desired output.
We looked at model parameters as points.
We created loss landscapes by including the loss value for each point in a subset of the parameter space.
Lastly, we created low dimensional slices to visualize landscapes for complex models.

Beyond Loss

Understanding and manipulating the loss landscape is not everything in ML. Another major area of research is understanding if the data and loss functions appropriately represent a useful objective. Meaning, does a model with 0 loss get 100% accuracy? More importantly, how does such a model handle new data, and does it generate outputs that are useful, fair, and safe?

Feel free to reach out with any questions or comments on the tweet below and follow me @tuckerkirven to see announcements about other posts like this.

Starting ML felt like I was in the woods at night with only a glow in the dark compass.
I could head south, but not knowing anything about the woods made me uneasy. This post was inspired by the paper that was my flashlight.https://t.co/T6pJFUc18v #ML #MachineLearning
— Tucker Kirven 🧢 (@tuckerkirven) December 30, 2019