Deep Learning is the subset of Machine Learning which performs like a charm in most of the applications. It even outperforms the ML in accuracy, and handling unstructured data. Deep Learning will perform the best when the parameters are set properly. These parameters are also called "Hyper-Parameters", and the process to select correct parameters is called as "Hyper-Parameter Tuning".
If you have created the Deep Learning model as per the last post "Do It Yourself Deep Learning Projects", this post will help you understand the parameters used in it. If you haven't seen the last post, it gives a guideline about creating your own DL project.
This post is going to be around the important hyper parameters in DL.
Model Parameters and Hyper Parameters
There are several parameters which affect the deep learning model in certain ways. The parameters which are like building blocks for the model include the learning rate, optimizer selection, selection of activation functions, batch size, number of epochs, Regularization techniques (such as Dropout) and loss functions.
Learning Rate
Probably the most important hyperparameter. Which sort of controls how fast your model “learns”.
So why not live life on the fast lane? Not that simple. Remember, in deep learning, our goal is to minimize a loss function. If the learning rate is too high, our loss will start jumping all over the place and never converge as it can be seen in the figure below.
A Basepoint Learning is set to "0.01" which is like default value for most of the time in models.
Optimizers
We have already discussed about most common and popular Optimization Techniques in one of the past blogs Optimization Methods in Deep Learning. To glide over it again, here's the list of some optimization methods:
Stochastic Gradient Descent
Momentum
Nesterov accelerated gradient
AdaGrad (Adaptive Gradient Descent)
AdaDelta
RMS-Prop (Root Mean Square Propagation)
Adam
Activation Functions
There is a separate article which you can see here "What are Activation Functions?". This post gives you a deeper dive on the same topic, a sneak-peek of which is as follows.
There are many Activation Functions of which the following ones are most popular.
Linear Function
Step Function
Sigmoid
Tanh (Hyperbolic Tangent)
ReLU (Rectified Linear Unit)
Leaky ReLU
Softmax
Batch Size
Stochastic Training is when the minibatch size = 1 and Batch Training is when the minibatch size = Number of examples in the training set.
A larger batch size allows computational boosts that utilizes matrix multiplication in the training calculations but that comes at the expense of needing more memory for the training process. A smaller batch size induces more noise in their error calculations and often more useful in preventing the training process from stopping at local minima.
Good value for minibatch size= 32
Recommended starting values= 1, 2, 4, 8, 16, 32, 64, 128, 256
Number of Epochs
To choose the right number of epochs for our training step, the metric we should have our eye on is the Validation Error. Intuitive manual way is to have the model train for as many number of iterations as long as validation error keeps decreasing.
There’s a technique called Early Stopping to determine when to stop training the model. Stopping the training process if the validation error has not improved in the past 10 or 20 epochs
Dropout
This technique is generally used in Deep Learning Neural Networks. During training, some number of layer outputs are randomly ignored or “dropped out.” This has the effect of making the layer look-like and be treated-like a layer with a different number of nodes and connectivity to the prior layer. In effect, each update to a layer during training is performed with a different “view” of the configured layer. Dropout is a regularization technique proposed by Geoff Hinton that randomly sets activations in a neural network to 0 with a probability of pp. This helps prevent neural nets from overfitting (memorizing) the data as opposed to learning it.
pp is a hyperparameter.
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections
Loss Functions
Loss functions in Binary Classification-based problem
Binary Cross Entropy
Cross-entropy is a commonly used loss function to use for classification problems. It measures the difference between two probability distributions. If the cross entropy is small, it suggests that two distributions are similar to each other.
In case of a binary classification the predicted probability is compared to the target/actual (0 or 1). Binary Cross-entropy calculates a score that provides the negative average difference between the actual and predicted probabilities for predicting the class 1. This score imposes a penalty to the probability based on the distance from the expected value. The loss function is defined as:
Loss = -y*log(ŷ) - (1-y)*log(1-ŷ)
i.e.,
Loss = -y * log(ŷ), if y = 1
or
Loss = -(1 - y) * log(1 - ŷ), if y = 0
Loss functions in Multi-class Classification-based problem
a) Multi-class Cross Entropy
In case of a multi-class classification the predicted probability is compared to the target/actual, where each class is assigned a unique integer value (0, 1, 3, …, n), assuming data has "n" unique classes. It calculates a score that provides the negative average difference between the actual and predicted probabilities for all classes. The loss function is defined as:
Here, ŷ(k) is 0 or 1, indicating whether class label is the correct classification.
For categorical cross entropy loss function, one needs to ensure that in an n-dimensional vector, all entries will be "0" except the entry corresponding to the class, which is 1 (one-hot-encoding).
e.g., for a 3-class classification problem, where 1st observation belongs to 3rd class, 2nd observation belongs to 1st class and 3rd observation belongs to 2nd class, the target (y) will be: y = [[0,0,1], [1,0,0], [0,1,0]]
b) Sparse Multi-class Cross-Entropy Loss
Both, multi-class cross entropy and sparse multi-class cross entropy have the same loss function, mentioned above. The only difference is the way true labels(y) is defined. For sparse categorical cross entropy, one needs to provide a single integer unit only rather than an n-dimensional vector. Note that the integer represents the class of the data.
For multi-class cross entropy, actual targets (y) are one-hot encoded. For a 3-class classification [[0,0,1], [1,0,0], [0,1,0]]
For sparse multi-class cross entropy, actual targets (y) are integers. For above 3-class classification problem: [3], [1], [2]
Advantage compared Multi-class Cross Entropy:
Above example shows that for multi-class cross entropy, the target needs a one hot encoded vector which contains a lot of zero values, leading to significant memory requirement. By using sparse categorical cross entropy, one can save computation time with lower memory requirement because it only requires a single integer for a class, rather than a whole vector.
Disadvantage of Sparse Multi-class Cross Entropy:
Multi-class cross entropy can be used in any kind of classification problem. Whereas, Sparse categorical cross entropy can only be used when each input belongs to a single class only.
For example, if we have 3 classes (a, b, c) and let us say an input belongs to class b and c, then the label for Multi-class cross entropy can be represented as [0,1,1] but can’t be expressed in Sparse Multi-class.
Comentarios