ADAM Optimizer | Baeldung on Computer Science

1. Introduction

In this tutorial, we’ll talk about ADAM, an optimization algorithm we frequently use to train machine-learning models.

2. Optimization in Learning

When training models such as neural networks or support vector machines, we search for the model’s parameters that minimize the cost function quantifying the model’s predictions’ deviation from the correct labels.

Over the years, many optimization algorithms have been proposed. Stochastic gradient descent (SGD) with mini-batches updates the parameters by evaluating the gradient only on a sample from training data.

Momentum algorithms keep track of the gradient history and use the sequence of gradients at previous iterations to update the parameter vector at each new iteration.

Some of these algorithms are adaptive. That means they change the learning rate in each iteration according to the current gradient or the history of gradients recorded so far.

ADAM, whose name is an abbreviation for adaptive moments, combines all those ideas and is currently one of the most widely used training algorithms.

3. ADAM

The distinctive features of ADAM are:

mini-batch gradient updates
adaptive momentum
adaptive learning rate
bias correction

The first one on the list means we use a sample of training data to update the parameters in each iteration.

We’ll denote with $\{(x_i^{(t)}, y_i^{(t)})\}_{i=1}^{m}$ the sample we use in the $t$ -th iteration.

3.1. Adaptive Momentum

Let:

$\theta^{(t)}=\begin{bmatrix} \theta_1^{(t)} \\ \theta_2^{(t)} \\ \ldots \\ \theta_n^{(t)} \end{bmatrix}$

be the parameter vector in the $t$ -th iteration. Its gradient, with respect to the cost function, is:

$g^{(t)}=\begin{bmatrix} g_1^{(t)} \\ g_2^{(t)} \\ \ldots \\ g_n^{(t)} \end{bmatrix}$

In the mini-batch SGD, we update $\theta^{(t)}$ by going in the direction opposite to $g^{(t)}$ :

$\theta^{(t+1)} = \theta^{(t)} - \alpha^{(t)} g^{(t)}$

where $\alpha^{(t)}$ is the learning rate at the iteration $t$ . Usually, we let $\alpha^{(t)}$ decay exponentially with $t$ .

ADAM follows a different strategy. Instead of only $\boldsymbol{g^{(t)}}$ , it uses all the gradients computed before: $g^{(t-1)}, g^{(t-2)}, \ldots, g^{(1)}$ . The idea is to treat $\theta^{(t)}$ as a particle in the parameter space. Its previous gradients show its velocity and direction. Even if it gets stuck in a subspace with small gradients, the momentum of the previous steps is expected to push it in the right direction. In contrast, the ordinary SGD will stop if the gradient reaches zero.

So, this is how ADAM defines the momentum vector:

$s^{(t)} = \beta_1 s^{(t-1)} + (1-\beta_1) g^{(t)}$

It’s a linear combination of all the gradients computed up to the iteration $t$ , but with the coefficients of the gradients computed in the distant past decaying exponentially.

3.2. Bias Correction

The vector $s$ estimates the expected value of the gradient. First, we initialize it to the zero vector, so its computation history is as follows:

$\begin{aligned} s^{(0)} &= 0 \\ s^{(1)} &= (1-\beta_1) g^{(1)} \\ s^{(2)} &= \beta_1(1-\beta_1) g^{(1)} + (1-\beta_1) g^{(2)} \\ s^{(3)} &= \beta_1^2(1-\beta_1) g^{(1)} + (1-\beta_1)\beta_1 g^{(2)} + (1-\beta_1) g^{(2)} \\ \ldots & \\ s^{(t)} &= (1-\beta_1)\sum_{k=1}^{t}\beta_1^{t-k}g^{(k)} \end{aligned}$

Taking the expectations of the last equation’s both sides, we get:

$E\left[s^{(t)}\right] = (1-\beta_1)\sum_{k=1}^{t}\beta_1^{t-k}E \left[g^{(k)} \right]$

Each expectation $E\left[ g^{(k)} \right]$ is off from $E\left[ g^{(t)} \right]$ . We can account for all the (weighted) deviations with a constant $\D$ :

$E\left[s^{(t)}\right] = E \left[g^{(t)} \right] (1-\beta_1)\sum_{k=1}^{t}\beta_1^{t-k} + D$

which we can keep small if $\beta_1$ results in small coefficients for less recent gradients.

Since $\sum_{k=1}^{t}\beta_1^{t-k} = \frac{1-\beta_1^t}{1-\beta_1}$ , the equation comes down to:

$E\left[s^{(t)}\right] \approx E \left[g^{(t)} \right] (1-\beta_1^t)$

So, to correct $\boldsymbol{s^{(t)}}$ for the bias, we divide it by $\boldsymbol{1-\beta_1^t}$ . The update step is then:

$\begin{cases} s^{(t)} = \beta_1 s^{(t-1)} + (1-\beta_1) g^{(t)} \\ \widehat{s}^{(t)} = \frac{1}{1-\beta_1^t}s^{(t)} \\ \theta^{(t+1)} = \theta^{(t)} - \alpha^{(t)} \widehat{s}^{(t)} \end{cases}$

3.3. Adaptive Learning Rate

The final ingredient is the adaptive learning rate $\alpha^{(t)}$ . In ADAM, each parameter has its learning rate, so $\alpha^{(t)}$ is a vector of rates that multiply with $\widehat{s}^{(t)}$ one element at a time.

What do we want to achieve with adaptive rates? The idea is that if a dimension is rarely updated and has a lot of zero gradients in its history, we should allow it a higher learning rate. In ADAM, the learning rate is inversely proportional to the square of the sum of the gradient elements:

$r^{(t)} = \beta_2 r^{(t-1)} + (1-\beta_2) (g^{(t)} \odot g^{(t)})$

where $\odot$ is element-wise multiplication.

More specifically, the rate vector is inversely proportional to the variance of the gradient elements.

Just as we did with $s^{(t)}$ , we need to correct $r^{(t)}$ :

$\widehat{r}^{(t)} = \frac{1}{1-\beta_2^t} r^{(t)}$

The derivation is the same as for $s^{(t)}$ .

Finally, the update step is:

$\begin{cases} s^{(t)} = \beta_1 s^{(t-1)} + (1-\beta_1) g^{(t)} \\ \widehat{s}^{(t)} = \frac{1}{1-\beta_1^t} s^{(t)} \\ r^{(t)} = \beta_2 r^{(t-1)} + (1-\beta_2) (g^{(t)} \odot g^{(t)}) \\ \widehat{r}^{(t)} = \frac{1}{1-\beta_2^t} r^{(t)} \\ \theta^{(t+1)} = \theta^{(t)} - \epsilon \frac{\widehat{s}^{(t)}}{\sqrt{\widehat{r}^{(t)}} + \delta} \end{cases}$

where $\epsilon$ is the step size, and $\delta$ is a small constant we use to avoid division by zero.

3.4. Pseudocode

Here’s the pseudocode of ADAM:

There are various termination criteria to use. For instance, we can set the maximal number of iterations or stop the algorithm if the difference between two consecutive $\theta$ vectors is less than a predefined constant.

4. Conclusion

In this article, we explained how ADAM works.

ADAM is an adaptive optimization algorithm we use for training machine-learning models. It uses the history of mini-batch gradients to find the direction of its update steps and tune its learning rates.

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex

Full Archive

About Baeldung