Going back to the original question, we’d like to find a and such that the loss is minimal. If we vectorize these 2 quantities and initialize them as , how can we change such that the loss function reduces with the new ? The answer lies in the Taylor Series. Further, since the Taylor Series works only within small intervals, we can moderate the change in using a learning rate such that .
For ease of notation, if we take , we have (from the Taylor Series):
is the gradient (derivative) of the function )with respect to . The derivative of a vector is the partial derivative of each element in it. For example, if we have a function based on , the gradient of this would be:
The second order gradient is the gradient of the gradient. Speaking generally, the derivative of a vector with respect to another vector looks all possible combination of elements between the two vectors and computes the partial derivative for each. So, if we have a vector with elements and we take it’s derivative with respect to another vector with , elements, the result would be a matrix of shape .
Going back to the approximation of based on a small change to , the change made to would be favourable only if the new less were less than the original loss:
For the loss to be minimal, we want the above quantity to be as negative as possible. If is the angle between and :
Multiplying all sides with :
Thus, the difference between the new loss and the old one is the most negative when . The angle between the gradient vector and the delta vector, which represents the change in should be , which means that should move in the direction opposite to the gradient.