Consider the last layer of a typical classifier, with logits \({\bf x} = (x_1, x_2, \ldots, x_n)\), a softmax activation yielding \(p_i = \exp(x_i)/\sum_j \exp(x_j)\), and a cross-entropy loss function
\(L = - \sum_i y_i \log p_i\).
where \(y_1, y_2, \ldots, y_n\) are labels that correspond to a probability distribution, that is, \(y_i \geq 0\) and \(\sum_i y_i = 1\).
Compute the partial derivatives \(\partial L/\partial x_i\) needed for backpropagation.
Answer 1 - Direct Approach:
Expand and simplify the formula for \(L\):
$$\begin{eqnarray*} L &=& - \sum_i y_i \log p_i\\ &=& - \sum_i y_i (\log \exp(x_i) - \log\sum_k \exp(x_k))\\ &=& - \sum_i y_i (x_i - \log\sum_k \exp(x_k))\\ &=& - \sum_i y_i x_i + \sum_i y_i \log\sum_k \exp(x_k))\\ &=& - \sum_i y_i x_i + \log\sum_k \exp(x_k)) \end{eqnarray*}$$
Now compute the derivative:
$$\begin{eqnarray*} \frac{\partial L}{\partial x_j} &=& - y_j + \frac{\partial}{\partial x_j}\log \sum_k \exp(x_k)\\ &=& - y_j + \frac{\exp(x_j)}{\sum_k \exp(x_k)}\\ &=& - y_j + p_j. \end{eqnarray*}$$
Answer 2 - Chain rule:
The chain rule gives us:
$$\begin{eqnarray*} \frac{\partial L}{\partial x_j} &=& \sum_i \frac{\partial L}{\partial p_i}\frac{\partial p_i}{\partial x_j} \end{eqnarray*}$$
Computing each term of the products:
$$\begin{eqnarray*} \frac{\partial L}{\partial p_i} &=& - \frac{y_i}{p_i}.\\ \frac{\partial p_i}{\partial x_j} &=& \frac{\partial}{\partial x_j}\left( \frac{\exp(x_i)}{\sum_k \exp(x_k)}\right)\\ &=& \frac{\delta_{ij}\exp(x_i)\sum_k\exp(x_k) - \exp(x_i)\exp(x_j)} {(\sum_k\exp(x_k))^2}\\ &=& \delta_{ij}p_i - p_i p_j, \end{eqnarray*}$$ where \(\delta_{ij}\) is 1 if \(i = j\) and 0 otherwise.
Combining previous computations:
$$\begin{eqnarray*} \frac{\partial L}{\partial x_j} &=& \sum_i (-\frac{y_i}{p_i})(\delta_{ij}p_i - p_i p_j)\\ &=& - \sum_i y_i\delta_{ij} + \sum_i y_i p_j)\\ &=& - y_j + p_j. \end{eqnarray*}$$