Computing Gradient of Loss Function With Respect to Hidden Layers

Now that we’ve computed the gradient of the loss function with respect to the output layer, we now move on to the hidden layer. Since there are several hidden layers, we would obviously like to derive a formula that would work for all hidden layers.

Let’s say we’d like to compute the gradient of the loss function with respect to one neuron in the $i^{t h}$ layer in the network, which isn’t the input or output layers.

What we’d like to compute is:

\frac{\partial L ( θ )}{\partial h _{ij}}

Generally, the derivative of a function $p (z)$ , which can be written as a function of some intermediate function $q (z)$ is:

\frac{\partial p ( x )}{\partial z} = m \sum \frac{\partial p ( z )}{\partial q _{m}} \frac{\partial q _{m}}{\partial z}

Where $m$ represents each computation of the intermediate function.

In our case, the hidden layers determine the output of the layers above them, which in turn determine the loss function. So, the hidden layers would be $z$ , the layers above them are generalised as $q (z)$ and the loss function is $p (z)$ . Applying the above logic to this context:

\frac{\partial L ( θ )}{\partial h _{ij}} = m = 1 \sum k \frac{\partial L ( θ )}{\partial a _{i + 1, m}} \frac{\partial a _{i + 1, m}}{\partial h _{ij}}

= m = 1 \sum k \frac{\partial L ( θ )}{\partial a _{i + 1, m}} W_{i + 1, m, j}

$i$ refers to the layer being looked at, $m$ refers to each neuron in the layer and $j$ is the neuron in the $i^{t h}$ layer whose gradient we’re trying to compute.

Now consider two vectors:

\nabla_{a_{i} + 1} L (θ) = \frac{\partial L ( θ )}{\partial a _{i + 1, 1}} ... \frac{\partial L ( θ )}{\partial a _{i + 1, k}}

W_{i + 1, *, j} = W_{i + 1, 1, j} ... W_{i + 1, k, j}

The * means that we take all rows in the $j^{t h}$ column of $W_{i + 1}$ .

Taking the dot product of the two vectors, we get:

(W_{i + 1, *, j})^{T} \nabla_{a_{i + 1}} L (θ) = m = 1 \sum k \frac{\partial L ( θ )}{\partial a _{i + 1, m}} W_{i + 1, m, j}

This is equal to the derivative of the loss function with respect to one neuron in the hidden layer. Hence:

\frac{\partial L ( θ )}{\partial h _{ij}} = (W_{i + 1, *, j})^{T} \nabla_{a_{i + 1}} L (θ)

We can apply this to the entire layer the neuron is in:

\nabla_{h_{i}} L (θ) = \frac{\partial L ( θ )}{\partial h _{i 1}} ... \frac{\partial L ( θ )}{\partial h _{in}} = [(W_{i + 1, *, 1})^{T} \nabla_{a_{i + 1}} L (θ) (W_{i + 1, *, n})^{T} \nabla_{a_{i + 1}} L (θ)]

Where $n$ is the number of layers in the network.

The problem here is that except for the output layer $(i = L)$ , we don’t know how to calculate $\nabla_{a_{i}} L (θ)$ . We need to be able to compute the gradient with respect to the next hidden layer (i.e., $\nabla_{a_{i + 1}} L (θ)$ ) in order to compute the same for this one. As always, let’s start off by computing the gradient for a single neuron:

\frac{\partial L ( θ )}{\partial a _{ij}} = \frac{\partial L ( θ )}{\partial h _{ij}} \frac{\partial h _{ij}}{\partial a _{ij}}

= \frac{\partial L ( θ )}{\partial h _{ij}} g^{'} (a_{ij}) [∵ h_{ij} = g (a_{ij})]

Applying this to the entire vector, we have:

\nabla_{a_{i}} L (θ) = \frac{\partial L ( θ )}{\partial h _{i 1}} g^{'} (a_{i 1}) ... \frac{\partial L ( θ )}{\partial h _{in}} g^{'} (a_{in})

This is an element-wise multiplication of the two vectors. Therefore:

\nabla_{a_{i}} L (θ) = \nabla_{h_{i}} L (θ) ⊙ [..., g^{'} (a_{ik}), ...]

Now that we can calculate the gradient with respect to the layers above a particular hidden layer, we can calculate it for the hidden layer as well. Note that this doesn’t create circular dependency. To compute the gradient w.r.t $a_{i}$ we need $h_{i}$ ‘s gradient, but for $h_{i}$ we need $a_{i + 1}$ .

IITM-BS Notes

Computing Gradient of Loss Function With Respect to Hidden Layers