Advantages & Disadvantages of Principal Component Analysis (PCA)
The Principal Component Analysis (PCA) is a statistical method that allows us to simplify the complexity of our data: a large number of features (variables) can be reduced to just a couple of them. Nevertheless, this procedure has its pros and its cons.
In this tutorial, you’ll learn about the advantages and disadvantages of the PCA method.
The table of content follows the structure below:
Let’s see what the advantages and disadvantages of PCA are!
Advantages of PCA
Performing a PCA can be a very good idea if we aim to extract the important features from our large data set. Take a look at some of the advantages of PCA:
One of the main issues when analyzing a high-dimensional data set is the overfitting: this happens when there are too many variables in the data set. Using PCA to lower the dimensions of the data set can prevent such an overfit.
Removes Correlated Features
Multicollinearity, or high correlation between independent variables, can make it difficult to determine the effect of individual variables on the predicted outcome. PCA can simplify the interpretation of the model by reducing the multicollinearity in the dataset.
Speeds Up Other Machine Learning Algorithms
When we use the principal components of the data set instead of all the variables and want to implement machine learning algorithms, this will help them to converge faster. With fewer features, the training time of the algorithms will decrease.
Trying to understand and visualize a high-dimensional data set can be difficult. The PCA helps us transform our data in high dimensions to a low-dimensional data set, so we can visualize it much better. You can check our visualization tutorials: Visualisation of PCA in R and Visualisation of PCA in Python to see some examples.
Disadvantages of PCA
Using the Principal Component Analysis method can also have some disadvantages:
The PCA algorithm identifies the directions of larger variations. As the variance of a variable is measured on its own squared scale, before calculating the principal components, all the variables should have a mean of 0 and a standard deviation of 1. Otherwise, those variables whose scale is larger would dominate the PCA. For further information, see PCA Using Correlation & Covariance Matrix.
Using the Principal Component Analysis can lead to some loss of information if we don’t select the right number of principal components that explain enough variation in the dataset.
Interpretation of Components
When we implement the Principal Component Analysis to our data set, the original features will be transformed into principal components: the linear combinations of the features of the original data. But which features are the most significant in the data set? This question can be difficult to answer after computing the PCA. Biplots are usually helpful to do that interpretation.
Video, Further Resources & Summary
Do you need more explanations on the advantages and disadvantages of the PCA? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel.
Furthermore, you could have a look at some of the other tutorials on Statistics Globe:
- What is PCA?
- PCA Using Correlation & Covariance Matrix
- Choose Optimal Number of Components for PCA
- Biplot for PCA Explained
- Visualization of PCA in Python
- Visualization of PCA in R
This post has shown the pros and cons of the Principal Component Analysis. In case you have further questions, you may leave a comment below.
This page was created in collaboration with Paula Villasante Soriano. Please have a look at Paula’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.