What are Activation Functions?

Mayuresh Madiwale
Jun 19, 2022
4 min read

Activation Functions are like keys to a lock. These are the main elements that give out the required value of an equation as below.

This is the basic equation of a linear model with an activation function to enhance the learning of complex parameters in the data. The values are then fed to next layer within the Neural Network. This is Forward Propagation and this is one of the two most important learning methods of NN. Updating the weights and biases is the other important function in learning that is also known as Backward Propagation. Both FP and BP combine to complete one EPOCH.

There are many Activation Functions of which the following ones are most popular.

Linear Function
Step Function
Sigmoid
Tanh (Hyperbolic Tangent)
ReLU (Rectified Linear Unit)
Leaky ReLU
Softmax

Linear Function

This is also called as "No Activation Function", this is because linear function gives the same output as input exactly as linear regression. This is pretty much outdated and out of use in industry.

Step Function

This is "Threshold Based" activation function which gets triggered if the threshold limit is crossed. This is very sensitive towards the threshold limit and can trigger even if the output surpasses the threshold even by a negligible amount. Thus it is not used in industry a lot. Small changes in weights and biases can cause very small changes in output and flip it from 0 to 1 and vice versa. This is also the base for working of a "Perceptron".

Sigmoid Function

The solution for problem with step function is Sigmoid. The curve is much smoother than step function. This curve ensures there is much information present about the direction in which the changes in weights have to be made if there's any errors. This is generally used in Logistic Regression. In sigmoid there is a major problem known as "Vanishing Gradient Problem" because it squishes the output in a small range of 0 to 1.

Hyperbolic Tangent Function (Tanh)

The problem of sigmoid is solved here with Tanh as the output range is changed to -1 to +1. Apart from that, all other properties of tanh function are the same as that of the sigmoid function. Similar to sigmoid, the tanh function is continuous and differentiable at all points. Usually tanh is preferred over the sigmoid function since it is zero centered and the gradients are not restricted to move in a certain direction.

Rectified Linear Unit (ReLU)

ReLU is an another activation function used in Neural Network which is NOT used in output nodes as it does not give output in a range. For example if ReLU is used in a classification problem of Dog vs Cat, it can give an output >1 which does not make any sense. For the negative input values, the result is zero, that means the neuron does not get activated. Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh function. In the negative side of graph the gradient is zero, due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated.

Leaky ReLU

It is an improved version of the ReLU function. As we saw that for the ReLU function, the gradient is 0 for x<0, which would deactivate the neurons in that region. Leaky ReLU is defined to address this problem. Instead of defining the ReLU function as 0 for negative values of x, we define it as an extremely small linear component of x. By making this small modification, the gradient of the left side of the graph comes out to be a non zero value. Hence we would no longer encounter dead neurons in that region.

Softmax

Softmax function is often described as a combination of multiple sigmoid functions. We know that sigmoid returns values between 0 and 1, which can be treated as probabilities of a data point belonging to a particular class. Thus sigmoid is widely used for binary classification problems. The Softmax function can be used for multiclass classification problems. This function returns the probability for any data point belonging to each individual class. All the outputs of Softmax combined gives the answer as 1. this is because all probabilities are between 0 and 1 and highest probability is 1 or 100%

Key Takeaways

Linear Function ==> No longer used ==> Also known as "No activation Function".
Step Function ===> No longer used ==> Similar to Perceptron ==> Threshold Based.
Sigmoid Function ==> Used in classification problems ==> Output is between 0-1 ==> Can be adversely affected by Vanishing Gradient Problem.
Tanh Function ==> Proffered over sigmoid ==> Output is in range -1 to +1.
ReLU Function ==> Used in Hidden layers ==> Can cause Dead Neurons due to zero gradient in negative side of graph.
Leaky ReLU ==> Negative side is made non zero which solves "Dead Neuron" problem.
Softmax ==> Combination of multiple sigmoid functions ==> Used in output node of NN ==> Used to solve multiclass classification problem.