Mathematical intuition behind Principal Component Analysis (PCA)

Mayuresh Madiwale
May 15, 2022
3 min read

Well, not exactly in machine learning. Here, if the variables increase, the model is likely to underfit or overfit and fail in its task. the number of variables should be controlled smartly and statistically.

"A lot of people do not believe in curses, but Data Scientists do!"

(The Curse of Dimensionality)

Thus, we need to do something to do to lower the risk of over or under fitting of the model. There are many Dimensionality Reduction techniques available as shown by the figure shown below. from these the most prominent one is PCA due to its ease of use and popularity amongst data scientists.

image credit: Dataaspirant.com

More number of features might sound like a lot of information is available about a certain topic, which is TRUE , but this also causes machine learning algorithm to learn too hard on features and probably over or underfit.

In this blog we'll see PCA and math under the hood.

There are some steps involved in the PCA which can bee seen in figure below.

Now, we cannot just implement PCA directly to any dataset. We need to take some precautions beforehand to make the PCA work. There are some assumptions involved in PCA that are:-

The dataset has NO NULL VALUES.
The DATA IS STANDARDIZED in the dataset.
ADEQUATE AMOUNT OF DATA is available.
NO SIGNIFICANT OUTLIERS are present.
The data is SUITABLE FOR DATA REDUCTION without much information loss.

Here, I've taken a dummy dataset to show the various steps in PCA. Let's have a look at the PCA step by step with help of an example:-

Dataset Standardization

Matrix_OD (Original Data)

	Feature A	Feature B	Feature C
Sample 1	1	2	4
Sample 2	4	1	2
Sample3	5	4	8

Mean	3.33	2.33	4.67
Std Dev	2.08	1.53	3.06

Formula for Standardization is as above, where, μ is the Mean and σ is the Standard Deviation.

Matrix_SD (Standardized Data)

	Feature A	Feature B	Feature C
Sample 1	-1.12	0.22	-0.22
Sample 2	0.32	-0.87	-0.87
Sample 3	0.80	1.09	1.09

Mean	0	0	0
Std Dev	1	1	1

Calculating the Covariance Matrix

A covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector. Any covariance matrix is symmetric and positive semi-definite and its main diagonal contains variances (i.e., the covariance of each element with itself).

From above formula, we can calculate the Covariance matrix for our data which gives following answer.

	Feature A	Feature B	Feature C
Sample 1	0.67	0.28	0.28
Sample 2	0.28	0.67	0.67
Sample 3	0.28	0.67	0.67

Calculating Eigenvalues and Eigenvectors

An eigenvector or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted by λ , is the factor by which the eigenvector is scaled.

An eigenvector, corresponding to a real nonzero eigenvalue, points in a direction in which it is stretched by the transformation and the eigenvalue is the factor by which it is stretched. If the eigenvalue is negative, the direction is reversed.

(A-λI)ν = 0 equation for ν vector with different λ values:

Sort Eigenvalues to their corresponding Eigenvectors

Here, all Eigenvalues are arranged with respect to their Eigenvectors.

λ	λ	λ
1.52	0.48	0

v	v	v
0.42	0.91	0
0.64	-0.30	-0.71
0.64	-0.30	0.71

Pick k Eigenvalues and form a matrix of Eigenvectors

Matrix_EV (Matrix of Eigenvectors)

v	v	v
0.42	0.91	0
0.64	-0.30	-0.71
0.64	-0.30	0.71

Now, Matrix_SD * Matrix_EV = Matrix_PCA

Thus, Matrix_PCA=

PC1	PC2	PC3
-0.75	-0.89	0
-0.98	0.82	0
1.73	0.07	0

Transform the Original Matrix

If we only consider the first Principal Component, the original matrix reduces from 3 variables to just one variable without changing the overall meaning of the data and without loosing any information for easier calculations for ant Machine Learning Algorithm

Final Matrix

	Principal Component 1
Sample 1	-0.75
Sample 2	-0.98
Sample 3	1.73

PCA using Sk-learn Library in Python

In Sk-learn, we have to provide only two requirements. First being the Standardized data and n_components.

Here, n_components = min(n_samples, n_features)

It is a lot easier using Sk-learn and that's why many do not know the math behind it. You can find the code snippet on my GitHub