Choose Optimal Number of Components for PCA (Theory & R Example)

 

In this tutorial, you’ll learn how to choose the number of components for a Principal Component Analysis (PCA).

I’ll explain theoretically why a certain number of components might be the best, and I’ll demonstrate how to apply this theory based on a reproducible example using the R programming language.

The table of content is structured as follows:

Let’s take a look.

 

Add-On Libraries, Sample Data and PCA

For this tutorial, we will be using some libraries you may need to install: the factoextra and the syndRomics packages. For this, you can follow this code:

install.packages("factoextra")
install.packages("remotes")
remotes::install_github("ucsf-ferguson-lab/syndRomics")

Next, load the libraries:

library(factoextra)
library(syndRomics)

For this example, we will be using the mtcars dataset. We will perform a PCA on it first:

data(mtcars)
pca_mtcars<-prcomp(mtcars, 
                   center = TRUE, 
                   scale. = TRUE)

We can see how our PCA looks like using the summary() function:

summary(pca_mtcars)
 
# Importance of components:
#                           PC1    PC2     PC3     PC4     PC5
# Standard deviation     2.5707 1.6280 0.79196 0.51923 0.47271
# Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031
# Cumulative Proportion  0.6008 0.8417 0.89873 0.92324 0.94356
#                            PC6    PC7     PC8    PC9    PC10
# Standard deviation     0.46000 0.3678 0.35057 0.2776 0.22811
# Proportion of Variance 0.01924 0.0123 0.01117 0.0070 0.00473
# Cumulative Proportion  0.96279 0.9751 0.98626 0.9933 0.99800
#                          PC11
# Standard deviation     0.1485
# Proportion of Variance 0.0020
# Cumulative Proportion  1.0000

Before getting started, let’s see what the principal components exactly are.

 

What are the Principal Components?

When running a PCA in a data frame, the number of variables in the data is reduced. For this, new variables are constructed as linear combinations of the initial variables: the principal components.

Thus, we could define the principal components as normalized linear combinations of the original variables in a data frame. Most of the information of the original variables is compressed into the first components, and all the principal components should be uncorrelated between them.

 

How to Choose the Principal Components

There isn’t a general method that works in every situation to choose the principal components for a PCA. In reality, the right way to select the principal components depends on every situation and the reason the PCA is being computed. Let’s see which are the different situations that can happen when running a PCA:

1. You Want to Visualize the Data

Computing a PCA can be very useful to visualize our data, especially when it is applied to a high-dimensional data frame.

Thus, if this is your main intention, you can select the first two or the first three principal components depending on the type of plot you would like to create: choosing 2 principal components would be enough if you want to visualize your data in two dimensions with a 2D plot, and for a 3D plot of a PCA you should choose the first three principal components.

2. You Would like to Reduce the Number of Features in Your Data Frame

If your aim when running the PCA is to reduce the number of features in your data frame, you have different possibilities. Let’s see:

2.1. Set a Threshold of Explained Variance & Choose

A common way to select the number of principal components is to set a threshold of explained variance, for example, 80%, and then select the number of principal components that generate this cumulative sum of explained variance. For example, using our example PCA, we could take a look at the cumulative sum of explained variance by using the get_eig() function, from the factoextra package:

get_eig(pca_mtcars)

Cumulative Explained Variance PCA mtcars

In the “cumulative variance percent” column, we can see the cumulative sum of explained variance, and that the two principal components generate more than this 80 percent of cumulative sum of explained variance. Therefore, we would pick the first two components using this method.

This is a subjective method of selecting the principal components, and thus this way of choosing them may not be the most effective to remove the noise.

2.2. Use the Kaiser’s Rule

Kaiser’s rule keeps all the components with eigenvalues greater than 1. By looking at the same output, we obtained previously with the get_eig() function and taking a look at the “eigenvalue” column, we can determine how many eigenvalues are greater than 1:

get_eig(pca_mtcars)

Eigenvalues PCA mtcars - PCA optimal number of components

We can see that only the first two eigenvalues are greater than 1, so according to this rule we would keep the first two components.

2.3. Make a Scree Plot

To decide which is the best number of principal components to select, we can create a scree plot: a visualization of the eigenvalues that define the magnitude of eigenvectors.

fviz_eig(pca_mtcars, 
         addlabels=TRUE) + ylim(0, 70)

Scree Plot - PCA choose number of components

Usually, we would select all the components up to where the bend occurs in the scree plot. This happens in the third component, so we would select the three first components.

2.3. Use a Permutation Test

A permutation test is a statistical tool for constructing sampling distributions. It consists in generating n new data sets from our original data set.

The process would be something like this:

1. Once you’ve run the PCA on your data, save the explained variance by each component.

2. Sample the columns of your dataset after defining a number of tests (e.g., 1000).

3. Run again the PCA on each one of those tests and save the explained variance by each component.

4. Compare the explained variance from the permutated tests with one of the original dataset.

We expect that the original explained variance should be greater than the permutation explained variances. Let’s see an example of this method in R using the example PCA for the mtcars dataset. We will do 1000 permutations:

Permutation Test Using R

To take a permutation test in the R programming language, we will use the permut_pc_test() function from the syndRomics package. This function implements a non-parametric permutation test for loadings:

pca_mtcars_perm<-permut_pc_test(pca = pca_mtcars, 
                                pca_data = mtcars, 
                                ndim = 3, 
                                P = 1000)

Let’s see the results of this permutation test:

pca_mtcars_perm$results

Permutation Test Plot - PCA choose number of components

The permutation test results show that the two principal components are significant, with a pvalue smaller than 0.05.

We can also plot the results and see how they look like:

plot(pca_mtcars_perm, 
     plot_resample= TRUE)

Permutation Test Plot - PCA choose number of components

This graphic also shows that the two principal components in our PCA explain more variance than the ones generated from the permuted data. Thus, if we chose this method to select the optimal number of principal components, we should keep two components.

As shown, there are multiple ways to choose the optimal number of components to explain the variance in the dataset. It’s important to check different methods and select the one which gives the best result for the specific case at hand.

 

Video, Further Resources & Summary

You may need more explanations on how many components for a PCA should be chosen. In this case, you should have a look at the following YouTube video of the Statistics Globe YouTube channel.

 

The YouTube video will be added soon.

 

Moreover, there are some other tutorials on Statistics Globe you could be interested in:

This post has shown how to choose the optimal number of components for a PCA. In case you have further questions, you may leave a comment below.

 

Paula Villasante Soriano Statistician & R Programmer

This page was created in collaboration with Paula Villasante Soriano. Please have a look at Paula’s author page to get further information about her academic background and the other articles she has written for Statistics Globe.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


3 Comments. Leave new

  • Thank you for concise tutorial. I am having difficult time in understanding the third method, permutation test. How do we know the three components are statistically significant just by looking at the graph?

    Reply
  • Hi Sang-Won,

    Thank you for your feedback, it was very useful.

    After reading your comment, we decided to review and update the tutorial. We hope it’s more clear now.

    Let us know if you have any other questions or comments.

    Regards,

    Paula

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top