Processing math: 100%
  1. Consider the last layer of a typical classifier, with logits x=(x1,x2,,xn), a softmax activation yielding pi=exp(xi)/jexp(xj), and a cross-entropy loss function

    L=iyilogpi.

    where y1,y2,,yn are labels that correspond to a probability distribution, that is, yi0 and iyi=1.

    Compute the partial derivatives L/xi needed for backpropagation.

  2. Answer 1 - Direct Approach:

    Expand and simplify the formula for L:

    L=iyilogpi=iyi(logexp(xi)logkexp(xk))=iyi(xilogkexp(xk))=iyixi+iyilogkexp(xk))=iyixi+logkexp(xk))

    Now compute the derivative:

    Lxj=yj+xjlogkexp(xk)=yj+exp(xj)kexp(xk)=yj+pj.

    Answer 2 - Chain rule:

    The chain rule gives us:

    Lxj=iLpipixj

    Computing each term of the products:

    Lpi=yipi.pixj=xj(exp(xi)kexp(xk))=δijexp(xi)kexp(xk)exp(xi)exp(xj)(kexp(xk))2=δijpipipj, where δij is 1 if i=j and 0 otherwise.

    Combining previous computations:

    Lxj=i(yipi)(δijpipipj)=iyiδij+iyipj)=yj+pj.