### Who cares about loss functions anyway?

For a given ML model, the supervised machine learning process roughly follows the following algorithm: First, use the model's parameters to make some predictions. Then, if the prediction is incorrect, update the model parameters using the correct label to learn from this example. Lastly, repeat the process from step one. Deep learning follows this paradigm, where the network weights are used to make some prediction, and the loss function is used to generate some loss value that allows us to perform backpropagation on network weights using the network's Jacobian matrix. It would not be unfair to say that the loss function directs the 'learning' in supervised machine learning.

As an analogy, let's imagine you are learning to make a free throw in basketball. Now, after each try the coach will provide one of three types of feedback to the child:

1. Whether the ball went through the hoop. (1 or 0)
2. Whether the ball went too far left or right. (-1 for left, 1 for right, or 0 for making it)
3. The distance of the ball from the hoop center along with its orientation. (2d vector of distance in x, y plane)
4. The distance of the ball from the hoop along with personalize coaching on the shooting form. (2d vector along with complex instruction)

It is easy to see that assuming the coach is experienced, feedback 4 will likely result in the fastest learning where as feedback 1 may not be useful at all. This demonstrates the importance of choosing the right loss function when training a ML model.

### Metric Vs Loss Function

While a model's loss function provides feedback on how it should improve, often we do not judge the model's actual performance with that loss function. We instead use a metric such as model accuracy. In the above basketball example, the metric we are interested in is how often one makes a free throw, while the best loss function is personalize coaching advice. In fact, using the metric as a loss function can often lead to slower, inefficient learning. With all that said, let's dig into an overview of loss functions available in Pytorch.

### Regression Loss Functions

L1 Loss

L1, also called the mean absolute error, is simply the absolute difference between the label and the prediction. In the case of batch learning, this can be either the mean of the loss or the sum of all L1 loss in the batch. The reason to take absolute value is so negative predictions do not cancel out positive ones.

$\ell(x, y) = \sum_{i=1}^{N} \lvert x_i - y_i \rvert$

While L1 is not a distance function, it can be helpful to visualize it as the Manhattan distance of two points in a space (think of traveling by street and ave) as they share the same form. Furthermore, L1 is a loss function for regression problems, it doesn't make much sense when it comes to classification, a label of 2 is not twice as far as a label for 1, nor the linear difference between a probability and a label.

Note also that when using L1 the loss value grows linearly with the difference between prediction and label, so a difference of 2 has exactly twice the loss as a difference of 1. This leads to the tendency to bias for the median of some distribution. IE. for a data set of 8 items, 5 label 0s and 3 label 1s, a prediction of 8 0s gives an L1 loss of 3 and predicting the mean value 0.375 in each case results in a l1 loss of 3.75. This makes L1 ill-suited for complex problems.

MSE

Mean Square Error, is also the square of L2 norm. Norms are functions with certain properties, one of which is it takes vectors and returns a non zero value. The Euclidean distance function we are familiar with is the L2 norm. In the batch case we usually take the mean before the square root function, giving us the Root Mean Square Error function (RMSE).

$\ell(x, y) = \sqrt { \frac{\sum_{i=1}^{N} (x_i - y_i)^2}{N} }$

Because square root is a relatively expensive operation and neither the loss minima nor the characteristic of the loss function is affected by its removal, PyTorch like many other libraries uses MSE instead of RMSE:

$\ell(x, y) = \frac{\sum_{i=1}^{N} (x_i - y_i)^2}{N}$

This is also a loss function for regression problems as the squared distance of two boolean values makes even less sense than their absolute distance.

An important characteristic of MSE is how it penalizes large errors more, since the loss increases quadratic as the distance between the predicted and ground truth increases linearly. This is a double edge sword, MSE penalizes predictions that are wildly off target more which leads to a preference for smaller errors. On the other hand, this means outliers are weighted much more with MSE than they would under L1 loss. Unlike L1, the MSE is minimized at the mean instead of the median of a dataset. It is also important to note that since we are squaring the difference, the loss will not be of the same scale as the prediction/labels, which can lead to potential numerical issues with very large features/predictions, and can lead to exploding gradients when backpropagation is applied on outliers, or when data drift occurs in production setting.

Smooth L1

Smooth L1, also called the Hubber loss function, takes the best characteristics of both L1 and MSE loss. It is a piece wise function based on a parameter , where beta is an parameter of the function defaults to 1 in Pytorch. Smooth L1 behaves like MSE when the difference is below beta and L1 otherwise. Of course, since Smooth L1 is based on L1 and MSE, it is also a regression loss, it still doesn't make sense to take the difference of label with probabilistic confidence.

$\ell(x, y) = \begin{cases} {0.5(x_i - y_i)^2 / \beta} & {\text{if} \lvert x_i - y_i \rvert < \beta} \ {\lvert x_i - y_i \rvert - 0.5 \beta} & {\text{otherwise}} \ \end{cases}$

When the difference is less than Beta, we get the quadratic weighting of loss where a smaller error is much more preferable to a larger error. However, once the error is larger than Beta we get the linearly increase of loss with error. What this does is prevent our loss function from being influenced too heavily by outliers, which can lead to gradient explosion. The  Beta value should therefore be chosen as to be the cutoff of what is considered an outlier prediction in our data, or from experience based on exploded gradients.