MaxProp and AdaMax

Recall that the update rule for RMSProp is:

v_{t} = β v_{t - 1} + (1 - β) \nabla (w_{t})^{2}

w_{t + 1} = w_{t} - \frac{η}{v _{t} + ϵ} \nabla w_{t}

$v_{t}$ is essentially the weighted average of the squares of all the gradients to date. If we remove the weights i.e $β$ , it is the same as calculating the $L^{2}$ norm. Instead of taking $L_{2}$ norm, we can take $L^{\infty}$ norm, where each current gradient would taken to the power $\infty$ . The $L^{\infty}$ norm of any vector works out to its maximum element. In the calculation of $v_{t}$ , we sum up the current gradient and past gradient. If we apply $L^{\infty}$ norm (also known as $ma x ()$ norm) here, $v_{t}$ would be the maximum of the current gradient and all the past gradients. The update rule for MaxProp therefore is:

v_{t} = ma x (β v_{t - 1}, ∣\nabla w_{t} ∣)

w_{t + 1} = w_{t} - \frac{η}{v _{t} + ϵ} \overset{m}{^}_{t}

Since this a modification of RMSProp, we use a constant initial learning rate.

Unlike Adam, we don’t use bias-correction in MaxProp. Suppose $v_{0} = 0$ and $\nabla w_{1} = 0.5$ , $v_{t} = ma x (0, 0.5) = 0.5$ , which means that is not biased towards zero even though it’s initialised to the same.

Moving on, one benefit of using the max norm instead L2 is that it keeps the learning constant when the gradient doesn’t change. This is especially relevant for sparse features. If we had used the L2 norm (RMSProp) instead, the learning rate would gradually increase (even with bias-correction), even though $\nabla w_{t} = 0$ .

While the MaxProp algorithm modified RMSProp by taking the max norm instead of the L2 norm, we can modify the Adam algorithm in the same manner. This algorithm is called AdaMax and its update rule is:

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) \nabla w_{t}

\overset{m}{^}_{t} = \frac{m _{t}}{1 - β _{1}^{t}}

v_{t} = ma x (β_{2} v_{t - 1}, ∣\nabla w_{t} ∣)

w_{t + 1} = w_{t} - \frac{η}{v _{t} + ϵ} \overset{m}{^}_{t}

Notice that there is no bias correction for $v_{t}$ as it’s not susceptible to initial bias towards zero.

IITM-BS Notes

MaxProp and AdaMax