An optimizer (in particular, variants of SGD) is said to use momentum when the optimizer takes previous weight updates into account when computing the next weight update to prevent getting stuck in a local minima. A naive implementation and explanation: past_velocity = 0 # constant momentum factor # kind of like the weight in a moving average momentum = 0.1 # optimization loop while loss > 0.01: w, loss, gradient = model.params() # update the current velocity using past velocity plus current update velocity = past_velocity * momentum + learning_rate * gradient # reduce the update by a bit (momentum * velocity) to prevent # falling into the local minima w = w + momentum * velocity - learning_rate * gradient past_velocity = velocity model.update(w)