Learning Parameters - Gradient Descent

Going back to the original question, we’d like to find a $w$ and $b$ such that the loss is minimal. If we vectorize these 2 quantities and initialize them as $θ = [w, b]$ , how can we change $θ$ such that the loss function reduces with the new $θ$ ? The answer lies in the Taylor Series. Further, since the Taylor Series works only within small intervals, we can moderate the change in $θ$ using a learning rate $η$ such that $θ_{n e w} = θ + η * Δ θ$ .

For ease of notation, if we take $Δ θ = u$ , we have (from the Taylor Series):

L (θ + η u) = L (θ) + η * u^{T} \nabla_{θ} L (θ) + \frac{η ^{2}}{2 !} u^{T} \nabla_{θ}^{2} L (θ) u + ...

$\nabla_{θ} L (θ)$ is the gradient (derivative) of the function $L (θ$ )with respect to $θ$ . The derivative of a vector is the partial derivative of each element in it. For example, if we have a function $y = w^{2} + b^{2}$ based on $θ$ , the gradient of this would be:

$[\frac{\partial y}{\partial w} \frac{\partial y}{\partial b}] = [2 w 2 b]$

The second order gradient $2! u^{T} \nabla_{θ}^{2} L (θ)$ is the gradient of the gradient. Speaking generally, the derivative of a vector with respect to another vector looks all possible combination of elements between the two vectors and computes the partial derivative for each. So, if we have a vector with $m$ elements and we take it’s derivative with respect to another vector with $n$ , elements, the result would be a matrix of shape $m \times n$ .

Going back to the approximation of $θ_{n e w}$ based on a small change to $θ$ , the change made to $θ$ would be favourable only if the new less were less than the original loss:

L (θ + η u) < L (θ)

⟹ L (θ + η u) - L (θ) < 0

⟹ u^{T} \nabla_{θ} L (θ) < 0

For the loss to be minimal, we want the above quantity to be as negative as possible. If $β$ is the angle between $u^{T}$ and $u^{T} \nabla_{θ} L (θ)$ :

- 1 \leq cos (β) = \frac{u ^{T} \nabla _{θ} L ( θ )}{∣∣ u ^{T} ∣∣ * ∣∣ \nabla _{θ} L ( θ ) ∣∣} \leq 1

Multiplying all sides with $k = ∣∣ u^{T} ∣∣ * ∣∣ \nabla_{θ} L (θ) ∣∣$ :

- k \leq k * cos (β) = u^{T} \nabla_{θ} L (θ) \leq k

Thus, the difference between the new loss and the old one is the most negative when $cos (β) = - 1 ⟹ β = 180°$ . The angle between the gradient vector and the delta vector, which represents the change in $θ$ should be $180°$ , which means that $u$ should move in the direction opposite to the gradient.

IITM-BS Notes

Learning Parameters - Gradient Descent