1. Consider the last layer of a typical classifier, with logits \({\bf x} = (x_1, x_2, \ldots, x_n)\), a softmax activation yielding \(p_i = \exp(x_i)/\sum_j \exp(x_j)\), and a cross-entropy loss function

    \(L = - \sum_i y_i \log p_i\).

    where \(y_1, y_2, \ldots, y_n\) are labels that correspond to a probability distribution, that is, \(y_i \geq 0\) and \(\sum_i y_i = 1\).

    Compute the partial derivatives \(\partial L/\partial x_i\) needed for backpropagation.

  2. Answer 1 - Direct Approach:

    Expand and simplify the formula for \(L\):

    $$\begin{eqnarray*} L &=& - \sum_i y_i \log p_i\\ &=& - \sum_i y_i (\log \exp(x_i) - \log\sum_k \exp(x_k))\\ &=& - \sum_i y_i (x_i - \log\sum_k \exp(x_k))\\ &=& - \sum_i y_i x_i + \sum_i y_i \log\sum_k \exp(x_k))\\ &=& - \sum_i y_i x_i + \log\sum_k \exp(x_k)) \end{eqnarray*}$$

    Now compute the derivative:

    $$\begin{eqnarray*} \frac{\partial L}{\partial x_j} &=& - y_j + \frac{\partial}{\partial x_j}\log \sum_k \exp(x_k)\\ &=& - y_j + \frac{\exp(x_j)}{\sum_k \exp(x_k)}\\ &=& - y_j + p_j. \end{eqnarray*}$$

    Answer 2 - Chain rule:

    The chain rule gives us:

    $$\begin{eqnarray*} \frac{\partial L}{\partial x_j} &=& \sum_i \frac{\partial L}{\partial p_i}\frac{\partial p_i}{\partial x_j} \end{eqnarray*}$$

    Computing each term of the products:

    $$\begin{eqnarray*} \frac{\partial L}{\partial p_i} &=& - \frac{y_i}{p_i}.\\ \frac{\partial p_i}{\partial x_j} &=& \frac{\partial}{\partial x_j}\left( \frac{\exp(x_i)}{\sum_k \exp(x_k)}\right)\\ &=& \frac{\delta_{ij}\exp(x_i)\sum_k\exp(x_k) - \exp(x_i)\exp(x_j)} {(\sum_k\exp(x_k))^2}\\ &=& \delta_{ij}p_i - p_i p_j, \end{eqnarray*}$$ where \(\delta_{ij}\) is 1 if \(i = j\) and 0 otherwise.

    Combining previous computations:

    $$\begin{eqnarray*} \frac{\partial L}{\partial x_j} &=& \sum_i (-\frac{y_i}{p_i})(\delta_{ij}p_i - p_i p_j)\\ &=& - \sum_i y_i\delta_{ij} + \sum_i y_i p_j)\\ &=& - y_j + p_j. \end{eqnarray*}$$