1. Introduction

In this tutorial, we’ll explain the difference between the cost, loss, and objective functions in machine learning. However, we should note that there’s no consensus on the exact definitions and that the three terms are often used as synonyms.

2. Loss Functions

The loss function quantifies how much a model \boldsymbol{f}‘s prediction \boldsymbol{\hat{y} \equiv f(\mathbf{x})} deviates from the ground truth \boldsymbol{y \equiv y(\mathbf{x})} for one particular object \mathbf{x}. So, when we calculate loss, we do it for a single object in the training or test sets.

There are many different loss functions we can choose from, and each has its advantages and shortcomings. In general, any distance metric defined over the space of target values can act as a loss function.

2.1. Example: the Square and Absolute Losses in Regression

Very often, we use the square(d) error as the loss function in regression problems:

    \[L_{square} \left( \hat{y}, y \right) = \left( \hat{y} - y \right)^2\]

For instance, let’s say that our model predicts a flat’s price (in thousands of dollars) based on the number of rooms, area (m^2), floor, and the neighborhood in the city (A or B). Let’s suppose that its prediction for \mathbf{x} = \begin{bmatrix} 4, 70, 1, A \end{bmatrix} is USD 110k. If the actual selling price is USD 105k, then the square loss is:

    \[L_{square}(110, 105) = (110-105)^2=5^2=25\]

Another loss function we often use for regression is the absolute loss:

    \[L_{abs} \left( \hat{y}, y \right) = | \hat{y} - y  |\]

In our example with apartment prices, its value will be:

    \[L_{abs} \left( 110, 105 \right) = | 110 - 105 | = |5| = 5\]

Choosing the loss function isn’t an easy task. Through cost, loss plays a critical role in fitting a model.

3. Cost Functions

The term cost is often used as synonymous with loss. However, some authors make a clear difference between the two. For them, the cost function measures the model’s error on a group of objects, whereas the loss function deals with a single data instance.

So, if L is our loss function, then we calculate the cost function by aggregating the loss L over the training, validation, or test data \mathcal{D}= \left\{ (\mathbf{x}_i, y_i) \right\}_{i=1}^{n}. For example, we can compute the cost as the mean loss:

    \[Cost(f, \mathcal{D}) = \frac{1}{n} \sum_{i=1}^{n} L \left( \hat{y}_i, y_i \right) \qquad \left( \hat{y}_i = f(\mathbf{x}_i) \right)\]

But, nothing stops us from using the median, the summary statistic less sensitive to outliers:

    \[Cost(f, \mathcal{D}) = \mathrm{median} \left\{ Loss \left( \hat{y}_i, y_i \right) \right\}_{i=1}^{n}\]

The cost functions serve two purposes. First, its value for the test data estimates our model’s performance on unseen objects. That allows us to compare different models and choose the best. Second, we use it to train our models.

3.1. Example: Cost as the Average Square Loss

Let’s say that we have the data on four flats and that our model predicted the sale prices \hat{y} as follows:

    \[\begin{matrix} \mathbf{x}_i & rooms & area & floor & neighborhood & y & \hat{y} \\ \hline \mathbf{x}_1 & 4 & 70 & 1 & A & 105 & 104.5\\ \mathbf{x}_2 & 2 & 50 & 2 & A & 83 & 91 \\ \mathbf{x}_3 & 1 & 30 & 5 & B & 50 & 65.3\\ \mathbf{x}_4 & 5 & 90 & 2 & A & 200 & 114 \end{matrix}\]

We can calculate the cost, i.e., the total loss of f over the data, as the mean square loss for individual flats:

    \[\frac{(104.5-105)^2 + (91-83)^2 + (65.3-50)^2 + (114-200)^2 }{4}= \frac{0.5^2+8^2+15.3^2+86^2}{4} = \frac{0.25+64+234.09+7396}{4} = \frac{7694.34}{4} =1923.585\]

3.2. Other Examples of Cost

However, the cost isn’t in the same units as y and \hat{y}. Instead of thousands of dollars, the numerical value of the cost denotes millions of squared dollars. That’s a problem for interpretation since the square of a currency doesn’t make sense in the real world. We can address it by taking the square root of the mean square loss:

    \[\sqrt{1923.585} \approx 43.86\]

This particular cost function is known as Root-Mean-Square Error (RMSE). We usually interpret it as the expected deviation of predictions from the ground truth. So, in our example, we conclude that the predicted flat prices are off by USD 43,860 on average. Using the mean absolute loss we’d get the total cost of:

    \[\frac{|104.5-105| + |91-83| + |65.3-50| + |114-200|}{4}= \frac{0.5+8+15.3+86}{4} = \frac{109.8}{4} =27.45\]

That is USD 27,450 per flat on average. Similarly, the root of the median square loss yields the cost of 12.208, i.e., approximately twelve thousand dollars.

As we see, just as there are many ways to define a loss for a single object, there are multiple ways to combine the losses over a set of instances.

3.3. How to Remember the Difference Between the Loss and Cost?

Many newcomers to the field (and the experts alike) complain that the difference between the loss and cost is artificial and that they are often confused about which one is for what. A mnemonic trick is to remember that loss starts the same as lonely. So, the loss is for a single, lonely data instance, while the cost is for the set of objects.

4. Objective Functions

While training a model, we minimize the cost (loss) over the training data. However, its low value isn’t the only thing we should care about. The generalization capability is even more important since the model that works well only for the training data is useless in practice.

So, to avoid overfitting, we add a regularization term that penalizes the model’s complexity. That way, we get a new function to minimize during training:

    \[J( f, \mathcal{D} ) = Cost( f , \mathcal{D} ) + Regularizer( f )\]

In general, the objective function is the one we optimize, i.e., whose value we want to either minimize or maximize. The cost function, that is, the loss over a whole set of data, is not necessarily the one we’ll minimize, although it can be. For instance, we can fit a model without regularization, in which case the objective function is the cost function.

4.1. Example: the Loss, Cost, and the Objective Function in Linear Regression

Let’s say we are training a linear regression model:

    \[f( \mathbf{x} ) = \sum_{j=0}^d}\theta_j x_j \qquad \left( \mathbf{x} = \begin{bmatrix} x_0 & x_1 & \ldots & x_d \end{bmatrix}^T \right)\]

We’ll assume the data are d-dimensional, and we prepend a dummy zero value x_0=1 to all the instances to simplify the expression.

Averaging the square loss over the training data \mathcal{D}= \left\{ (\mathbf{x}_i, y_i) \right\}_{i=1}^{n}, we get:

    \[\frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i  \right)^2 = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \sum_{j=0}^{d} \theta_j x_j \right)^2\]

That’s our cost function, or, as we can also call it, the loss over the data \mathcal{D}. However, if we want to prevent f from overfitting \mathcal{D}, we can add a regularization term (whose parameter \lambda‘s value we can determine empirically):

    \[\frac{1}{n} \sum_{i=1}^{n} \left( y_i - \sum_{j=0}^{d} \theta_j x_j \right)^2  +  \lambda \sum_{j=1}^{d} \theta_j^2  \qquad \lambda > 0\]

Therefore, the objective function we’ll minimize during training is the sum of the cost and the regularization penalty. Usually, we divide it by 2 to make the calculation of derivatives easier:

    \[\frac{1}{ \mathbf{2} n} \sum_{i=1}^{n} \left( y_i - \sum_{j=0}^{d} \theta_j x_j \right)^2  + \frac{ \lambda}{ \mathbf{2} } \sum_{j=1}^{d} \theta_j^2  \qquad \lambda > 0\]

Let’s see how this works in our flat price prediction example.

4.2. Calculation Step 1: Get Predictions

Let’s suppose that we coded neighborhood A as 1 and B as 0. There are five parameters \theta_0, \theta_1, \ldots, \theta_4 in our regression model:

    \[f(\mathbf{x}) = \theta_0 + \theta_1 \cdot rooms + \theta_2 \cdot area + \theta_3 \cdot floor + \theta_4 \cdot neighborhood\]

where \mathbf{x} = \begin{bmatrix} 1 & rooms & area & floor & neighborhood \end{bmatrix}^T. Let’s say that the training data is the same as earlier:

    \[\begin{matrix} \mathbf{x}_i & rooms & area & floor & neighborhood & y \\ \hline \mathbf{x}_1 & 4 & 70 & 1 & 1 & 105 \\ \mathbf{x}_2 & 2 & 50 & 2 & 1 & 83  \\ \mathbf{x}_3 & 1 & 30 & 5 & 0 & 50 \\ \mathbf{x}_4 & 5 & 90 & 2 & 1 & 200 \end{matrix}\]

If \theta_0=50, \theta_1=5, \theta_2=0.2, \theta_3=0.5, \theta_4=20, a training algorithm would first get the model’s predictions:

    \[\begin{aligned} \hat{y}_1 &= 50 + 5 \cdot 4 + 0.2 \cdot 70 + 0.5 \cdot 1 + 20 \cdot 1 = 104.5 \\ \hat{y}_2 &= 50 + 5 \cdot 2 + 0.2 \cdot 50 + 0.5 \cdot 2 + 20 \cdot 1 = 91\\ \hat{y}_3 &= 50 + 5 \cdot 1 + 0.2 \cdot 39 + 0.5 \cdot 5 + 20 \cdot 0 = 65.3 \\ \hat{y}_4 &= 50 + 5 \cdot 5 + 0.2 \cdot 90 + 0.5 \cdot 2 + 20 \cdot 1 = 114 \\ \end{aligned}\]

4.3. Calculation Step 2: Compute the Objective Function

Then, it would calculate the cost. In Section 3.1., we computed the mean square loss of 1923.585 for the same predictions. So, the only thing remaining is the regularization term. If we use \lambda=0.5 as the regularization parameter, the term is:

    \[0.5 \cdot (50^2 + 5^2 + 0.2^2 + 0.5^2 + 20^2) = 0.5 \cdot 2925.29 = 1462.645\]

So, the value of the objective function is:

    \[1923.585+ 1462.645 = 3386.23\]

Since the objective function combines the cost and the regularization penalty, its value isn’t easy to interpret.

5. Conclusion

In this article, we explained the meanings of the loss, cost, and objective functions. While some researchers and practitioners use the terms interchangeably, others differentiate between them.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.