Principal Component Analysis (PCA) in R

 

In this tutorial you’ll learn how to perform a Principal Component Analysis (PCA) in R.

The table of content is structured as follows:

Let’s take a look at the procedure in R.

 

Example Data & Add-On Packages

For this tutorial, we will use the biopsy data from the MASS package. This is a breast cancer database obtained from the University of Wisconsin Hospitals, Dr. William H. Wolberg. He assessed biopsies of breast tumors for 699 patients.

In order to use this database, we will need to install the MASS package first, as follows:

install.packages("MASS")

In order to visualize our data, we will use the factoextra and the ggfortify packages. You can do it using this code below:

install.packages("factoextra")
install.packages("ggfortify")

Next, load the packages:

library(MASS)
library(factoextra)
library(ggfortify)

As shown below, the biopsy data contains 699 observations of 11 variables:

data(biopsy)
str(biopsy)
 
#'data.frame':	699 obs. of  11 variables:
# $ ID   : chr  "1000025" "1002945" "1015425" "1016277" ...
# $ V1   : int  5 5 3 6 4 8 1 2 2 4 ...
# $ V2   : int  1 4 1 8 1 10 1 1 1 2 ...
# $ V3   : int  1 4 1 8 1 10 1 2 1 1 ...
# $ V4   : int  1 5 1 1 3 8 1 1 1 1 ...
# $ V5   : int  2 7 2 3 2 7 2 2 2 2 ...
# $ V6   : int  1 10 2 4 1 10 10 1 1 1 ...
# $ V7   : int  3 3 3 3 3 9 3 3 1 2 ...
# $ V8   : int  1 2 1 7 1 7 1 1 1 1 ...
# $ V9   : int  1 1 1 1 1 1 1 1 5 1 ...
# $ class: Factor w/ 2 levels "benign",
# "malignant": 1 1 1 1 1 2 1 1 1 1 ...

We can also see that there’s a categorical variable, “class”, with two levels, and a variable defining the IDs. As PCA works better with only numerical data, we will exclude these variables from the data set before we start. We will also need to exclude the NA values by using the na.omit() function:

data_biopsy <- na.omit(biopsy[,-c(1,11)])
head(data_biopsy)

Data_for_PCA_R

Note that we may also substitute the missing values using missing data imputation techniques. However, for the sake of this tutorial it should be sufficient to simply remove them from our data.

Now, we’re ready to start.

 

Step 1: Calculate the Principal Components

Once we have loaded our data, the first step is to calculate the principal components in our data set. To accomplish this, we will use the prcomp() function in R:

biopsy_pca <- prcomp(data_biopsy, 
                     scale = TRUE)

The “scale = TRUE” argument allows us to make sure that each of the variables in the biopsy data is scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components. Let’s see the elements of our PCA:

names(biopsy_pca)
 
# [1] "sdev"     "rotation" "center"   "scale"    "x"

The “sdev” element corresponds to the standard deviation of the principal components, and the “center” and “scale” elements provide information about the original means and standard deviations of the variables before the transformation to principal components. The “$x” element allows us to see the principal component scores.

The columns of the “rotation” element show the weights that are used in the linear transformation of the original variables to the principal components. In other words, each column represents an eigenvector.

biopsy_pca$rotation
 
#          PC1         PC2          PC3         PC4         PC5         PC6          PC7
# V1 -0.3020626 -0.14080053  0.866372452 -0.10782844  0.08032124 -0.24251752 -0.008515668
# V2 -0.3807930 -0.04664031 -0.019937801  0.20425540 -0.14565287 -0.13903168 -0.205434260
# V3 -0.3775825 -0.08242247  0.033510871  0.17586560 -0.10839155 -0.07452713 -0.127209198
# V4 -0.3327236 -0.05209438 -0.412647341 -0.49317257 -0.01956898 -0.65462877  0.123830400
# V5 -0.3362340  0.16440439 -0.087742529  0.42738358 -0.63669325  0.06930891  0.211018210
# V6 -0.3350675 -0.26126062  0.000691478 -0.49861767 -0.12477294  0.60922054  0.402790095
# V7 -0.3457474 -0.22807676 -0.213071845 -0.01304734  0.22766572  0.29889733 -0.700417365
# V8 -0.3355914  0.03396582 -0.134248356  0.41711347  0.69021015  0.02151820  0.459782742
# V9 -0.2302064  0.90555729  0.080492170 -0.25898781  0.10504168  0.14834515 -0.132116994
#          PC8          PC9
# V1  0.24770729 -0.002747438
# V2 -0.43629981 -0.733210938
# V3 -0.58272674  0.667480798
# V4  0.16343403  0.046019211
# V5  0.45866910  0.066890623
# V6 -0.12665288 -0.076510293
# V7  0.38371888  0.062241047
# V8  0.07401187 -0.022078692
# V9 -0.05353693  0.007496101

To see how much variance is accounted for by each principal component, the summary() function is employed as follows:

summary(biopsy_pca)
 
# Importance of components:
#                          PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9
# Standard deviation     2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259 0.51062 0.29729
# Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271 0.02897 0.00982
# Cumulative Proportion  0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121 0.99018 1.00000

As given in the Proportion of Variance row, the first principal component explains around 65% of the total variance, which corresponds to two-thirds of the information in the data set.

The same variance information can also be retrieved by using the standard deviations of the principal components. Check the code below and compared it with the previous table.

biopsy_pca$sdev^2 / sum(biopsy_pca$sdev^2)
 
# [1] 0.655499928 0.086216321 0.059916916 0.051069717 0.042252870
# [6] 0.033541828 0.032711413 0.028970651 0.009820358

 

Step 2: Visualize the Explained Variance

We can visualize the percentage of explained variance (i.e. the eigenvalues) to see how this first principal component explains a considerable proportion of variance:

fviz_eig(biopsy_pca, 
         addlabels = TRUE, 
         ylim = c(0, 70))

Principal Component Analysis

 

Step 3: Plot the PCA Results

In order to visualize the principal component scores and loading vectors, we can create a biplot, which is a kind of scatterplot. In this plot, the first and the second principal components are shown on the axes:

autoplot(biopsy_pca, 
         data = data_biopsy, 
         loadings = TRUE)

Principal Component Analysis

We can visualize how the observations (named as individuals on the plot) with similar (original) variable values are grouped together:

fviz_pca_ind(biopsy_pca,
             col.ind = "cos2",
             gradient.cols = c("red","yellow","green3","royalblue"),
             repel = TRUE)

Principal Component Analysis

The col.ind = "cos2" argument indicates the colors used to represent the quality of representation by the principal components. The argument "repel = TRUE" avoids overlaid text labels.

Next, we can visualize the loading vectors and see how the variables are correlated:

fviz_pca_var(biopsy_pca,
             col.var = "contrib", 
             gradient.cols = c("red","gold","green3","royalblue"))

Principal Component Analysis

The col.var = “contrib” argument indicates the colors used to represent the level of contribution of the variables to the principal components. The variables that are highly correlated are closely located on the plot. See our tutorial for further information on interpreting the loading vectors.

 

Video, Further Resources & Summary

Do you need more explanations on how to perform a PCA in R? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel.

 

The YouTube video will be added soon.

 

Furthermore, you could have a look at some of the other tutorials on Statistics Globe:

This post has shown how to perform a PCA in R. In case you have further questions, you may leave a comment below.

 

Paula Villasante Soriano Statistician & R Programmer

This page was created in collaboration with Paula Villasante Soriano. Please have a look at Paula’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


4 Comments. Leave new

  • Very nice!

    Reply
  • Dr.Khurram Shahzad
    January 13, 2023 5:34 pm

    Sir, my question is that how we can create the data set with no column name of the first column as in the below data set, and second what should be the structure of data set for PCA analysis?

    mpg cyl disp hp drat wt qsec vs am gear carb
    Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
    Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
    Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
    Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
    Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

    Reply
    • Hello Khurram,

      If you would like to ignore the column names, you can write rownames(df) <- NULL. Only the continuous variables should include in the analysis and the data should be in a wide format. If you have any further questions, do not hesitate to contact us. Regards, Cansu

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top