Computing Gradient of Loss Function With Respect to Parameters

After examining the roles that the output layer and hidden layers play in the performance of a model, we now turn to the parameters (i.e., weights and biases) and look at their culpability.

Let’s look at the weights first:

\frac{\partial L ( θ )}{\partial W _{kij}} = \frac{\partial L ( θ )}{\partial a _{ki}} \frac{\partial a _{ki}}{\partial W _{kij}}

In the RHS, we need to compute 2 partial derivatives. We know the result of the first one from the previous section. As for the second part, remember that $a_{K} = b_{k} + W_{k} * h_{k - 1}$ . If we differentiate this with respect to the weight, we’re left with just the activation output from the previous layer.

\frac{\partial L ( θ )}{\partial W _{kij}} = \frac{\partial L ( θ )}{\partial a _{ki}} h_{k - 1}, j

$W_{k}$ i.e., the weights for a particular layer is a matrix and not a vector. This matrix would look like:

\nabla_{W_{k}} L (θ) = \frac{\partial L ( θ )}{\partial W _{k 11}} ... ... ... ... ... \frac{\partial L ( θ )}{\partial W _{k 1 n}} ... \frac{\partial L ( θ )}{\partial W _{knn}}

Let’s assume that $W_{k} \in R^{3 \times 3}$ . In that case, the weight matrix would be (shamelessly taking a screenshot since I’m too lazy to type it out on my own):

Applying the formula derived above to this matrix, we get:

This matrix is just the outer product of the gradient of the loss function with respect to $a_{k}$ and the activation output of the $k - 1^{t h}$ layer:

\nabla_{w_{k}} L (θ) = \nabla_{a_{k}} L (θ) * h_{k - 1}^{T}

Looking at the bias, remember once again that $a_{K} = b_{k} + W_{k} * h_{k - 1}, j$ . This means that if we differentiate the loss function with respect to the bias, we get:

\frac{\partial L ( θ )}{\partial b _{kij}} = \frac{\partial L ( θ )}{\partial a _{ki}} \frac{\partial a _{ki}}{\partial b _{kij}}

= \frac{\partial L ( θ )}{\partial a _{ki}}

Hence, the gradient vector would be:

\nabla_{b_{k}} L (θ) = \frac{\partial L ( θ )}{\partial a _{k 1}} ... \frac{\partial L ( θ )}{\partial a _{kn}} = \nabla_{a_{k}} L (θ)

IITM-BS Notes

Computing Gradient of Loss Function With Respect to Parameters