But is an "Optimizer" in the first place?
Optimizers are algorithms or methods used to minimize an error function or to maximize the efficiency of production. Optimizers are mathematical functions which are dependent on model’s learnable parameters such as Weights & Biases. Optimizers help to know how to change weights and learning rate of neural network to reduce the losses.
Gradient Descent
Gradient Descent is known as one of the most commonly used optimization algorithms to train machine learning models by means of minimizing errors between actual and expected results. Further, gradient descent is also used to train Neural Networks.
Optimization is the task of minimizing the cost function parameterized by the model's parameters. The main objective of gradient descent is to minimize the convex function using iteration of parameter updates. Once these machine learning models are optimized, these models can be used as powerful tools for Artificial Intelligence and various computer science applications. Gradient Descent finds the way to achieve nearby minimum or "local" minimum and nearby maximum or "local" maximum.
But what is Local minimum and Local maximum?
The best way to define the local minimum or local maximum of a function using gradient descent is as follows:
If we move towards a negative gradient or away from the gradient of the function at the current point, it will give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at the current point, we will get the local maximum of that function.
points to remember about Gradient Descent:
Easy to understand
Easy to implement
It calculates the gradient for the entire data set in one update, so the calculation is very slow.
It requires large memory and it is computationally expensive.
What is Learning Rate and why is it important?
As the name suggests, it is the rate at which the GD algorithm descends the slope. It should be just right in order to get the optimum result.
Stochastic Gradient Descent
It is the variant of GD that tunes each hyperparameter individually one by one and finds the best combination. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily.
Points to remember about SGD:
Frequent update of model hyperparameters.
Requires less memory as one parameter is updated at a time.
Allows the use of large data sets.
Frequent updates can also result in noisy gradients which may cause the error to increase instead of decreasing it.
High Variance.
Frequent updates are computationally expensive as it occupies all resources.
Momentum
SGD fails in some cases around the local minima. This can make the result noisy and Momentum excels in this very situation. This method is named after the actual phenomena of momentum in physics. If we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way. The same thing happens to our parameter updates. The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
Points to remember:
Momentum helps to reduce the noise.
Exponential Weighted Average is used to smoothen the curve.
Extra hyperparameter is added.
Nesterov accelerated gradient
The momentum method of Nesterov is a modification to SGD with momentum that allows for even faster convergence in practice. With Nesterov accelerated gradient (NAG) descent, the update term is derived from the gradient of the loss function with respect to refined parameter values. These refined parameter values are computed by performing a SGD update step using the momentum history as the gradient term.
AdaGrad (Adaptive Gradient Descent)
In all the algorithms that we discussed previously the learning rate remains constant. The intuition behind AdaGrad is can we use different Learning Rates for each and every neuron for each and every hidden layer based on different iterations. One of Adagrad's main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that.
Adagrad's main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge
Points to remember:
Learning Rate changes adaptively with iterations.
It is able to train sparse data as well.
If the neural network is deep the learning rate becomes very small number which will cause dead neuron problem.
AdaDelta
AdaDelta is an extension of AdaGrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, AdaDelta restricts the window of accumulated past gradients to some fixed size. In Adadelta we do not need to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient.
Points to remember:
The main advantage of AdaDelta is that we do not need to set a default learning rate.
Computationally expensive
RMS-Prop (Root Mean Square Propagation)
RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates. It is a special version of AdaGrad in which the learning rate is an exponential average of the gradients instead of the cumulative sum of squared gradients. RMS-Prop basically combines momentum with AdaGrad.
Points to remember:
In RMS-Prop learning rate gets adjusted automatically and it chooses a different learning rate for each parameter.
Slow Learning
Comparison of all above methods.
In the figure below, we can see the local minima where the points settle for a moment and then descend towards the global minima.
Comments