Computing Gradient of Loss Function With Respect to Output Layer

In the process of investigating who is responsible for a model’s poor functioning, the first step is to inspect the output layer and compute the derivative of the loss function with respect to it.

Before that, let’s consider the partial derivative of the loss function with respect to one of the output layer neuron’s output.

L (θ) = - l o g \overset{y}{^}_{l}

Where $l$ refers to the one among the $k$ neurons that is the true class label.

\frac{\partial}{\partial y _{i} ^} L (θ) = \frac{\partial}{\partial y _{i} ^} (- l o g \overset{y}{^}_{l})

= - \frac{1}{y _{l} ^} if i = l

= 0 otherwise

This can be rewritten using the indicator notation:

\frac{\partial}{\partial y _{i} ^} L (θ) = - \frac{1 _{i = l}}{y _{l} ^}

If we simply apply this to all the neurons in the output, we get the gradient vector with respect to $\overset{y}{^}$ .

\nabla_{\overset{y}{^}} L (θ) = \frac{\partial L ( θ )}{\partial y _{1} ^} ... \frac{\partial L ( θ )}{\partial y _{k} ^} = - \frac{1}{y _{l} ^} 1_{l = 1} ... 1_{l = k}

= - \frac{1}{y _{l} ^} e_{l}

Where $e (l)$ is a k-dimensional vector, whose $l^{t h}$ element is 1 and the rest are 0.

What we’re actually interested is in the gradient of the loss function with respect to the pre-activation part of the output layer $a_{L}$ , since the final output simply takes $a_{L}$ and applies the softmax function on it.

\frac{\partial}{\partial a _{L i}} L (θ) = \frac{\partial}{\partial a _{L i}} (- l o g \overset{y}{^}_{l})

\frac{\partial}{\partial a _{L i}} - l o g \overset{y}{^}_{l} = - \frac{1}{y _{l} ^} \frac{\partial}{\partial a _{L i}} \overset{y_{l}}{^}

This ultimately works out to:

\frac{\partial}{\partial a _{L i}} L (θ) = - (1_{l = i} - \overset{y_{i}}{^})

The derivation for this is explained in lecture 3.5 of week 3.

This gives us the partial derivative of the loss function with respect to the $i^{t h}$ element of $a_{L}$ . Using this, we can write the gradient vector of the loss function with respect to the output layer:

\nabla_{a_{L}} L (θ) = \frac{\partial L ( θ )}{\partial a _{L i}} ... \frac{\partial L ( θ )}{\partial a _{L k}} = - (1_{l = i} - \overset{y_{2}}{^}) ... - (1_{l = k} - \overset{y_{k}}{^})

= - (e (l) - \overset{y}{^})

IITM-BS Notes

Computing Gradient of Loss Function With Respect to Output Layer