Authors Top

If you have a few years of experience in Computer Science or research, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

1. Introduction

In this tutorial, we have a closer look at the 0-1 loss function. It is an important metric for the quality of binary and multiclass classification algorithms.

Generally, the loss function plays a key role in deciding if a machine learning algorithm is actually useful or not for a given data set.

2. Measuring Model Quality

In the machine learning process, we have various steps including cleaning the data and creating a prediction function:


The step we want to have a closer look at is tagged in green: Measuring the quality of our prediction function. But how can we know if the model we created actually represents the underlying structure of our data?

To measure this quality of a data model, we can use three different metrics: Loss, accuracy, and precision. There are many more, like F1 score, recall, and AUC, but we will focus on those three.

The loss function takes the actual values and compares them with the predicted ones. There are several ways to compare them. A loss of 0 significates perfect prediction. The interpretation of the amount of loss depends on the given dataset and model. A popular loss function used in machine learning is the squared loss:

    \[\mathcal{L}_{sq} =\frac{1}{n}\sum^n_{i=1}(\tilde{y_i}-y_i)^2\]

In this formula, \tilde{y_i} is the correct result and y_i the predicted outcome. Squaring the differences actually gives us only positive results and magnifies large errors. As we can see, we also average our sum, dividing by n, to compensate for the size of our dataset.

Another statistical metric is the accuracy that directly measures how many of our predictions are right or wrong. In this case, it does not matter the size of the error, only if we have predicted false or correct. Accuracy can be calculated with the formula:

    \[\text{Accuracy}}={\frac {\text{correct classifications}}{\text{all classifications}}}}\]

With our last metric, the precision, we calculate how close the predictions are. Not to the original values, but to each other. The formula for precision is given by:

    \[\text{Precision}}={\frac {\text{TP}}{\text{TP+FP}}}}\]

TP means True Positives and FP False Positives.

3. Example

Let’s have a look at an example to get a better understanding of this process. Suppose we are working with a dataset of three pictures and we want to detect whether our pictures show dogs or not. Our machine learning model gives us a probability of a dog being displayed for each picture. In this example, it gives us 80% for the first picture, 70% for the second, and 20% for the third picture. Our threshold for recognizing the picture as a dog picture is 70% and all of the pictures are actually dogs.

We thus have a loss of \mathcal{L}_{sq} = \frac{1}{3}((1-0.8)^{2}+(1-0.7)^{2}+(1-0.2)^{2}) \approx 0.256, an accuracy of \frac{2}{3} and a precision of \frac{2}{3}, since we have 2 true positives and 1 false positive.

4. The 0-1 Loss Function

The 0-1 Loss function is actually synonymous with the accuracy we presented in chapter 2 even though its formula is often presented quiet differetly:

    \[\mathcal{L}_{01}(\tilde{y},y) =\frac{1}{n}\sum^n_{i=1} \delta_{\tilde{y_i} \ne y_i} \text{ with }\delta_{\tilde{y_i} = y_i}=\begin{cases} 0,&\mbox{if $\delta_{\tilde{y_i} \ne y_i}$ }\\ 1,&\mbox{otherwise}\]

This allows a different weighing of our results. For example, when working with disease recognition, we might want to have as few false negatives as possible, thus we could weigh them differently with the help of a loss matrix:

    \[ A = \left[ {\begin{array}{cc} 0 & 5 \\ 0.5 & 0 \\ \end{array} } \right] \]

This loss matrix amplifies false negatives and weighs just half the loss of true positives.

The general loss matrix has the form:

    \[ A = \left[ {\begin{array}{cc} TN & FN \\ FP & TP \\ \end{array} } \right] \]

The problem with our 0-1 Loss function stays. It’s not differentiable. Therefore it is not possible to apply methods such as gradient descent. Fortunately, we can still use a great palette of other classification algorithms, such as K-Means or Naive Bayes.

5. Conclusion

In this article, we covered the properties of the 0-1 loss function in the context of machine learning metrics.

Authors Bottom

If you have a few years of experience in Computer Science or research, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

Comments are closed on this article!