In general, best results are achieved during training if we:
A.:
When using SGD, it is necessary to gradually decrease the learning rate over time, to overcome the source of noise caused by the random sampling of a mini-batch. Whereas the true gradient becomes small and then 0 when we approach and reach a minimum, the estimated (stochastic) gradient suffers from this problem.