Consider the last layer of a typical classifier, with logits x=(x1,x2,…,xn), a softmax activation yielding pi=exp(xi)/∑jexp(xj), and a cross-entropy loss function
L=−∑iyilogpi.
where y1,y2,…,yn are labels that correspond to a probability distribution, that is, yi≥0 and ∑iyi=1.
Compute the partial derivatives ∂L/∂xi needed for backpropagation.
Answer 1 - Direct Approach:
Expand and simplify the formula for L:
L=−∑iyilogpi=−∑iyi(logexp(xi)−log∑kexp(xk))=−∑iyi(xi−log∑kexp(xk))=−∑iyixi+∑iyilog∑kexp(xk))=−∑iyixi+log∑kexp(xk))
Now compute the derivative:
∂L∂xj=−yj+∂∂xjlog∑kexp(xk)=−yj+exp(xj)∑kexp(xk)=−yj+pj.
Answer 2 - Chain rule:
The chain rule gives us:
∂L∂xj=∑i∂L∂pi∂pi∂xj
Computing each term of the products:
∂L∂pi=−yipi.∂pi∂xj=∂∂xj(exp(xi)∑kexp(xk))=δijexp(xi)∑kexp(xk)−exp(xi)exp(xj)(∑kexp(xk))2=δijpi−pipj, where δij is 1 if i=j and 0 otherwise.
Combining previous computations:
∂L∂xj=∑i(−yipi)(δijpi−pipj)=−∑iyiδij+∑iyipj)=−yj+pj.