The Gradient Descent Algorithm algorithm initializes the parameters of a neural network ( and ) randomly and then updates the parameters in the opposite direction to the gradient of the loss function taken with respect to the parameters.
The derivative of a function is essentially its sensitivity to small changes in the variable it’s computed in reference to. For any function , the steeper the curve is, the higher its derivative will be. , for example is steeper than , and it’s derivative too, unsurprisingly, is higher for any .
Recall the weight-update rule for the Gradient Descent Algorithm:
We can see that the magnitude of change to the parameters depends on the gradient. A high gradient would mean that the weights and bias would move towards convergence faster. Since we initialise the parameters randomly, it’s possible that they are initialised at a point where the loss function’s curve is rather flat, which would mean that the updates would also be small and the algorithm would take a long time to converge, since more iterations would be required to reach the convergence point. So, how do we speed up the process of updating the weights if the parameters are initialised at an unfavourable place in the curve?