Advantages & Disadvantages of Principal Component Analysis (PCA)
The Principal Component Analysis (PCA) is a statistical method that allows us to simplify the complexity of our data: a large number of features can be reduced to just a couple of them. Nevertheless, this procedure has its pros and its cons.
In this tutorial you’ll learn about the advantages and disadvantages of the PCA method.
The table of content follows the structure below:
Let’s see what the advantages and the disadvantages of the PCA are.
Advantages of the PCA
Performing a PCA can be a very good idea if we are aiming to extract the important features from our large data set. Take a look at some of the advantages of PCA:
The PCA can counteract the issues of a high-dimensional data set
One of the main issues when analyzing a high-dimensional data set is the overfitting: this happens when there are too many variables in the data set. Using PCA to lower the dimensions of the data set can prevent such an overfit.
Correlated features removed
This is the main characteristic of the PCA: it helps us to reduce a very large data set. This can be very useful if we need to run an algorithm through our data and/or visualize it. Otherwise, it would be very difficult to easily visualize all of our features.
This process could be done manually, but it would take a lot of time and effort: it would be necessary to find the correlation of our features manually, which often can be almost impossible.
When implementing the PCA in our data set, we get principal components which are independent of one another.
Speeds up other machine learning algorithms
When we use the principal components of the data set instead of all the variables and want to implement machine learning algorithms, this will help them to converge faster. With fewer features, the training time of the algorithms will decrease.
Improves visualization
Trying to understand and visualize a high-dimensional data set can be difficult. The PCA helps us to transform our data in high dimensions to a low dimensional data set, so we can visualize it a lot better.
Disadvantages of the PCA
Using the Principal Component Analysis method can also have some disadvantages in our analysis:
Data normalization required before performing the PCA
The PCA algorithm identifies those directions in which the variance in the data is bigger. As the variance of a variable is measured on its own squared scale, before calculating the principal components all the variables should have a mean of 0 and a standard deviation of 1. Otherwise, those variables whose scale is bigger would dominate the PCA.
We may lose some valuable information
Using the Principal Component Analysis can lead to some loss of information if we don’t select the right number of principal components that our data set and its variance needs.
Major components may be difficult to understand
When we implement the Principal Component Analysis to our data set, the original features will be transformed into principal components: linear combinations of features of the original data. But which features, variables or characteristics are the most significant in the data set? This question can be difficult to answer after computing the PCA.
Video, Further Resources & Summary
Do you need more explanations on the advantages and disadvantages of the PCA? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel.
The YouTube video will be added soon.
Furthermore, you could have a look at some of the other tutorials on Statistics Globe:
- Learn R Programming (Tutorial & Examples)
- R NA – What are
Values? - Cross-Validation Explained (Example)
This post has shown the pros and cons of the Principal Component Analysis. In case you have further questions, you may leave a comment below.
This page was created in collaboration with Paula Villasante Soriano. Please have a look at Paula’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.
Statistics Globe Newsletter