Adaptive Delta - AdaDelta

Both ADAGRAD and RMSProp are both sensitive to the initial learning rate and could possibly never converge if the learning rate chosen initially is not suitable. ADADelta avoids setting an initial learning rate and instead takes the difference in gradients between iterations to update the weights. The update rule for this algorithm is:

v_{t} = β v_{t - 1} + (1 - β) (\nabla w_{t})^{2}

Δ w_{t} = - \frac{u _{t - 1} + ϵ}{v _{t} + ϵ} \nabla w_{t}

u_{t} = β u_{t - 1} + (1 - β) (Δ w_{t})^{2}

w_{t + 1} = w_{t} + Δ w ⟹ w_{t + 1} = w_{t} - \frac{u _{t - 1} + ϵ}{v _{t} + ϵ} \nabla w_{t}

The effective learning rate here is $\frac{u _{t - 1} + ϵ}{v _{t} + ϵ}$ instead of $\frac{η}{v _{t} + ϵ}$ , which means that $η$ gets replaced by $u_{t - 1} + ϵ$ .

We can see that $u_{t}$ is a function of $Δ w$ , which in turn is a function of past updates to the gradient, as opposed to $η$ which is a constant. This means that the numerator of the effective learning rate isn’t constant and can change depending on the gradient/slope.

Also notice that at a particular iteration $t$ the numerator of the learning rate takes the accumulated history of gradients till the previous time step, which is why we take $u_{t - 1}$ . This means that $u$ is one iteration behind $v$ . So how can this help in adapting the learning rate to the slope/gradient of the region we are currently in? Both $v_{t}$ and $u_{t}$ increase after each iteration, but the magnitude of $u_{t}$ is less than that of $v_{t}$ as it takes only a fraction of the squared gradient, since it’s moderated by $\frac{u _{t - 1} + ϵ}{v _{t} + ϵ}$ .

In steep regions, where the gradient is high, $v_{t}$ would zoom ahead, but $u_{t}$ isn’t too far back since it’s only one iteration behind, ensuring that the effective learning wouldn’t decay as aggressively, even though $v_{t}$ increases fast.

Now, if we move to a flatter region, $v_{t}$ will reduce due to the momentum factor $β$ . However, since $u$ is behind $v$ , it would not have decreased so much, and the ratio of the numerator to the denominator would start to increase. If the gradient remains low for enough time, the learning rate would increase.

Thus, ADADELTA allows the numerator of the effective learning rate, which is kept constant in ADAGRAD and RMSProp, to vary depending on the previous gradients. The learning rate, therefore, decays slower:

This is because of the fact that $u_{t}$ changes in proportion to $v_{t}$ , keeping its influence in check.

This compares the two, with $n_{0} = 0.013$ for RMSProp and $w_{0} = - 4$ and $b_{0} = - 4$ for ADADelta. ADADelta converges faster as its learning rate doesn’t decay as quickly.

IITM-BS Notes

Adaptive Delta - AdaDelta