After the successful fitting of any ML model, we need to see how is it performing so we make a prediction with unseen data. Here two things can happen:- Model performs perfectly without any error in testing and training OR Model performs good but has errors.
This phase is called Model Evaluation. In this phase, we encounter the problem of Underfitting or Overfitting. Here we take a look at the BIAS and VARIANCE.
But What is Bias?
Bias is the difference between average model prediction and the truth value. High Bias means the model is just oversimplifying the data and will definitely score less in testing data. This is called Underfitting.
e.g. Model that has low training accuracy as well as low testing accuracy can be said as Underfit.
And what about Variance?
Variance is the spread of the predictions around the truth. High Variance also considers noise in the data and the model learns it as well, causing to over learn the data. Such a model would score around 95% to 99% on training data, but fail miserably on testing data.
The Tradeoff
Tradeoff simply means BALANCE, the balance between Bias and Variance. A good data science practitioner knows well how to set a balance between these. This can be achieved by several methods like using Ensemble Learning, Using variety of data points, balancing the imbalanced datasets ,etc.
By these methods we can reduce the error but cannot eliminate it because if we reduce bias, variance increases and vice - versa.
The Mathematics
Y=f(X) + c
The above equation shows a relationship between the dependent variable 'Y' and independent variable 'X' with the error term 'c'.
Now if we make a model f̂ (X) of f(X) using any modeling technique, the expected squared error at a point x is:-
The error can be further drilled down and we get TOTAL ERROR which is shown in equation below. A perfect model has minimum Total Error.
The best way to visualize all of the above is to look at the image below,
End Notes
We cannot eliminate Error
Reduce the Bias -> Variance increases
Reduce the Variance -> Bias increases
Finally we as Data scientists have to make a model with optimal Bias and Variance along with lowest possible error without compromising optimal model complexity.
Comentarios