Baeldung Pro – CS – NPI EA (cat = Baeldung on Computer Science)
announcement - icon

Learn through the super-clean Baeldung Pro experience:

>> Membership and Baeldung Pro.

No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.

1. Overview

Softmax is a commonly used activation function in multi-layer neural networks, especially in the output layer, for classification tasks. It maps its input to a probability distribution.

In this tutorial, we’ll discuss why softmax is used instead of simple normalization in a neural network’s output layer.

2. Definition of Softmax

The softmax function is defined as:

    \[softmax(z_{i})=\frac{e^{z_{i}}}{\sum_{k}^{}e^{z_{k}}}\]

It maps the elements of a vector, z = [z1, z2, …, zk] with k elements, to a probability distribution, i.e., the elements of the output vector are in the range [0,1], and the sum of the elements in the output vector is 1.

3. Simple Normalization as an Alternative to Softmax?

In a typical multi-layer neural network, the output layer takes the output of the last hidden layer, which consists of a vector of raw scores, and computes a final output. This final set of activations in the output of the last hidden layer is also known as logits. These are unnormalized predictions of the model.

The output layer maps the logits to a probability distribution in a typical classification task. Softmax is a generalization of the sigmoid function for multiclass classification with more than two classes. Image classification and sentiment classification are typical examples of multiclass classification.

Any other output layer that maps the logits to a probability distribution can be used instead of softmax. Standard normalization isn’t a good choice since it rescales data to obtain a mean of 0 and a standard deviation of 1. Therefore, the rescaled output isn’t a probability distribution.

Similarly, min-max normalization isn’t suitable either as it rescales the data between 0 and 1, but the total sum of the normalized data isn’t 1. Therefore, it doesn’t map to a probability distribution. Normalizations like standard and min-max normalization are suitable for feature scaling to help neural networks converge faster.

However, simple normalization can be applied to the logits by dividing each logit by the sum of logits. Indeed, this is equivalent to finding the percentage of each logit:

    \[simple\_normalization(z_{i})=\frac{{z_{i}}}{\sum_{k}^{}z_{k}}\]

We’ll compare this normalization with the softmax function in the subsequent section, and see whether it can be an alternative to softmax.

4. Comparison

Let’s have the following vector as the output of the last hidden layer of a multi-layer neural network as an example:

    \[z = [0.6, 0.7, 3.1]\]

For example, the logits in this vector may correspond to the raw predictions of three kinds of images, namely a rabbit, a cat, and a dog. Applying simple normalization yields the following probability distribution:

    \[simple\_normalization (z) = [0.1364, 0.1591, 0.7045]\]

On the other hand, using softmax gives the following probability distribution:

    \[softmax(z) = [0.0700, 0.0774, 0.8527]\]

The table that shows a comparison of softmax and simple normalization is below:

Softmax Simple Normalization
Amplifying Differences Yes No
Support of Negative Logits Yes No
Compatibility With the Cross-Entropy Loss Function Yes No

Now, let’s go into the details.

4.1. Amplifying Differences

Softmax amplifies the differences between logits because of exponentiation. It amplifies the larger values in the input while squashing the smaller values. However, simple normalization keeps the proportion between the logits the same.

For example, the probability of the image being a dog is 0.8527 when we use the softmax function, but it’s 0.7045 when we use simple normalization. Similarly, softmax maps the unnormalized logit 0.6 to 0.0700, which is the probability of the image being a rabbit, while simple normalization maps it to 0.1364. Therefore, softmax provides a greater measure of confidence.

When we scale the input logits, the probability distribution that softmax produces changes. For example, when we multiply the input vector by 2, i.e., z = [1.2, 1.4, 6.2], we obtain the following probability distribution:

    \[softmax(z) = [0.0066, 0.0081, 0.9853]\]

Notably, the probability of the image being a dog increases from 0.8527 to 0.9853 while the likelihood of the image being a rabbit decreases from 0.0700 to 0.0066. But, the probabilities don’t change if we use simple normalization.

The differences in the scales of the logits may be a result of the image’s quality. Logits with higher values may indicate a crisp image with large differences in contrast, while logits with low values may denote a blurry image with a gradual transition in contrast. Therefore, softmax is better at classifying blurry images since it amplifies the differences.

4.2. Negative Logits

Although the logits in our example are all positive numbers, there is no problem with a logit being negative. In this case, it isn’t possible to use simple normalization to obtain a probability distribution. For example, let’s reverse the signs of the first and last logits we’ve used before:

    \[z = [-0.6, 0.7, -3.1]\]

Applying simple normalization yields:

    \[simple\_normalization (z) = [0.2000, -2.333, 1.0333]\]

Although the sum of the elements in the transformed vector is 1, the second element, -2.333, is negative, and the last element, 1.0333, is greater than 1. Therefore, it isn’t possible to interpret them as probabilities.

Softmax works fine with negative logits:

    \[softmax (z) = [0.2105, 0.7723, 0.0173]\]

Since the exponent function is a monotonically increasing function that maps its input value to a positive number, softmax always maps the input vector to a probability function.

Furthermore, we can’t use simple normalization when the sum of logits is 0 since dividing by 0 is impossible.

4.3. Compatibility With the Cross-Entropy Loss Function

The cross-entropy loss function is commonly used in neural networks to calculate the difference between the true probability distribution and the estimated one:

    \[L=-{\sum_{k}^{}y_{k}\log p_{k}}\]

yk is the true probability. It’s 1 for the correct classification and 0 for the wrong classification. pk is the estimated probability, which is the output of softmax.

In a typical neural network using backpropagation, for example, a recurrent neural network, the output of the cross-entropy loss function is fed back to the hidden layers to obtain the gradients for adjusting the weights and minimizing the prediction errors. The gradient of the cross-entropy loss function with respect to the logits simplifies since softmax and the cross-entropy loss function are mathematically related:

    \[\frac{\partial L}{\partial z_k} = p_k - y_k\]

Therefore, this simple equation leads to an efficient gradient computation during backpropagation.

5. Conclusion

In this article, we discussed why softmax is used instead of simple normalization in a neural network’s output layer.

Firstly, we learned the definition of softmax. We saw that it maps the input vector to a probability distribution. It amplifies the differences between the elements of the input vector, hence mapping large values to higher probabilities.

Then, we saw that the goal of the output layer in a neural network for classification tasks is to map the logits to a probability distribution. Therefore, softmax is an ideal activation function for the output layer. We saw that standard and min-max normalizations aren’t suitable activation functions since they don’t map their input to a probability distribution.

Finally, we compared softmax with simple normalization of the logits. However, this simple normalization requires all the logits to be positive, which isn’t a valid requirement for logits. Even if all the logits are positive, softmax is more powerful since it amplifies the differences between logits. Additionally, softmax leads to an efficient calculation of the gradients during backpropagation when it’s used with the cross-entropy loss function.