After examining the roles that the output layer and hidden layers play in the performance of a model, we now turn to the parameters (i.e., weights and biases) and look at their culpability.
Let’s look at the weights first:
In the RHS, we need to compute 2 partial derivatives. We know the result of the first one from the previous section. As for the second part, remember that . If we differentiate this with respect to the weight, we’re left with just the activation output from the previous layer.
i.e., the weights for a particular layer is a matrix and not a vector. This matrix would look like:
Let’s assume that . In that case, the weight matrix would be (shamelessly taking a screenshot since I’m too lazy to type it out on my own):
Applying the formula derived above to this matrix, we get:
This matrix is just the outer product of the gradient of the loss function with respect to and the activation output of the layer:
Looking at the bias, remember once again that . This means that if we differentiate the loss function with respect to the bias, we get:
Hence, the gradient vector would be: