An optimizer (in particular, variants of SGD) is said to use momentum when the optimizer takes previous weight updates into account when computing the next weight update to prevent getting stuck in a local minima.

A naive implementation and explanation:

past_velocity = 0
# constant momentum factor
# kind of like the weight in a moving average
momentum = 0.1
# optimization loop
while loss > 0.01:
    w, loss, gradient = model.params()
    # update the current velocity using past velocity plus current update
    velocity = past_velocity * momentum + learning_rate * gradient
    # reduce the update by a bit (momentum * velocity) to prevent
    # falling into the local minima
    w = w + momentum * velocity - learning_rate * gradient
    past_velocity = velocity
    model.update(w)