Principal Component Analysis (PCA) in R
In this tutorial you’ll learn how to perform a Principal Component Analysis (PCA) in R.
The table of content is structured as follows:
Let’s take a look at the procedure in R.
Example Data & Add-On Packages
For this tutorial, we will use the biopsy data from the MASS package. This is a breast cancer database obtained from the University of Wisconsin Hospitals, Dr. William H. Wolberg. He assessed biopsies of breast tumors for 699 patients.
In order to use this database, we will need to install the MASS package first, as follows:
install.packages("MASS") |
install.packages("MASS")
In order to visualize our data, we will use the factoextra and the ggfortify packages. You can do it using this code below:
install.packages("factoextra") install.packages("ggfortify") |
install.packages("factoextra") install.packages("ggfortify")
Next, load the packages:
library(MASS) library(factoextra) library(ggfortify) |
library(MASS) library(factoextra) library(ggfortify)
As shown below, the biopsy data contains 699 observations of 11 variables:
data(biopsy) str(biopsy) #'data.frame': 699 obs. of 11 variables: # $ ID : chr "1000025" "1002945" "1015425" "1016277" ... # $ V1 : int 5 5 3 6 4 8 1 2 2 4 ... # $ V2 : int 1 4 1 8 1 10 1 1 1 2 ... # $ V3 : int 1 4 1 8 1 10 1 2 1 1 ... # $ V4 : int 1 5 1 1 3 8 1 1 1 1 ... # $ V5 : int 2 7 2 3 2 7 2 2 2 2 ... # $ V6 : int 1 10 2 4 1 10 10 1 1 1 ... # $ V7 : int 3 3 3 3 3 9 3 3 1 2 ... # $ V8 : int 1 2 1 7 1 7 1 1 1 1 ... # $ V9 : int 1 1 1 1 1 1 1 1 5 1 ... # $ class: Factor w/ 2 levels "benign", # "malignant": 1 1 1 1 1 2 1 1 1 1 ... |
data(biopsy) str(biopsy) #'data.frame': 699 obs. of 11 variables: # $ ID : chr "1000025" "1002945" "1015425" "1016277" ... # $ V1 : int 5 5 3 6 4 8 1 2 2 4 ... # $ V2 : int 1 4 1 8 1 10 1 1 1 2 ... # $ V3 : int 1 4 1 8 1 10 1 2 1 1 ... # $ V4 : int 1 5 1 1 3 8 1 1 1 1 ... # $ V5 : int 2 7 2 3 2 7 2 2 2 2 ... # $ V6 : int 1 10 2 4 1 10 10 1 1 1 ... # $ V7 : int 3 3 3 3 3 9 3 3 1 2 ... # $ V8 : int 1 2 1 7 1 7 1 1 1 1 ... # $ V9 : int 1 1 1 1 1 1 1 1 5 1 ... # $ class: Factor w/ 2 levels "benign", # "malignant": 1 1 1 1 1 2 1 1 1 1 ...
We can also see that there’s a categorical variable, “class”, with two levels, and a variable defining the IDs. As PCA works better with only numerical data, we will exclude these variables from the data set before we start. We will also need to exclude the NA values by using the na.omit() function:
data_biopsy <- na.omit(biopsy[,-c(1,11)]) head(data_biopsy) |
data_biopsy <- na.omit(biopsy[,-c(1,11)]) head(data_biopsy)
Note that we may also substitute the missing values using missing data imputation techniques. However, for the sake of this tutorial it should be sufficient to simply remove them from our data.
Now, we’re ready to start.
Step 1: Calculate the Principal Components
Once we have loaded our data, the first step is to calculate the principal components in our data set. To accomplish this, we will use the prcomp() function in R:
biopsy_pca <- prcomp(data_biopsy, scale = TRUE) |
biopsy_pca <- prcomp(data_biopsy, scale = TRUE)
The “scale = TRUE” argument allows us to make sure that each of the variables in the biopsy data is scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components. Let’s see the elements of our PCA:
names(biopsy_pca) # [1] "sdev" "rotation" "center" "scale" "x" |
names(biopsy_pca) # [1] "sdev" "rotation" "center" "scale" "x"
The “sdev” element corresponds to the standard deviation of the principal components, and the “center” and “scale” elements provide information about the original means and standard deviations of the variables before the transformation to principal components. The “$x” element allows us to see the principal component scores.
The columns of the “rotation” element show the weights that are used in the linear transformation of the original variables to the principal components. In other words, each column represents an eigenvector.
biopsy_pca$rotation # PC1 PC2 PC3 PC4 PC5 PC6 PC7 # V1 -0.3020626 -0.14080053 0.866372452 -0.10782844 0.08032124 -0.24251752 -0.008515668 # V2 -0.3807930 -0.04664031 -0.019937801 0.20425540 -0.14565287 -0.13903168 -0.205434260 # V3 -0.3775825 -0.08242247 0.033510871 0.17586560 -0.10839155 -0.07452713 -0.127209198 # V4 -0.3327236 -0.05209438 -0.412647341 -0.49317257 -0.01956898 -0.65462877 0.123830400 # V5 -0.3362340 0.16440439 -0.087742529 0.42738358 -0.63669325 0.06930891 0.211018210 # V6 -0.3350675 -0.26126062 0.000691478 -0.49861767 -0.12477294 0.60922054 0.402790095 # V7 -0.3457474 -0.22807676 -0.213071845 -0.01304734 0.22766572 0.29889733 -0.700417365 # V8 -0.3355914 0.03396582 -0.134248356 0.41711347 0.69021015 0.02151820 0.459782742 # V9 -0.2302064 0.90555729 0.080492170 -0.25898781 0.10504168 0.14834515 -0.132116994 # PC8 PC9 # V1 0.24770729 -0.002747438 # V2 -0.43629981 -0.733210938 # V3 -0.58272674 0.667480798 # V4 0.16343403 0.046019211 # V5 0.45866910 0.066890623 # V6 -0.12665288 -0.076510293 # V7 0.38371888 0.062241047 # V8 0.07401187 -0.022078692 # V9 -0.05353693 0.007496101 |
biopsy_pca$rotation # PC1 PC2 PC3 PC4 PC5 PC6 PC7 # V1 -0.3020626 -0.14080053 0.866372452 -0.10782844 0.08032124 -0.24251752 -0.008515668 # V2 -0.3807930 -0.04664031 -0.019937801 0.20425540 -0.14565287 -0.13903168 -0.205434260 # V3 -0.3775825 -0.08242247 0.033510871 0.17586560 -0.10839155 -0.07452713 -0.127209198 # V4 -0.3327236 -0.05209438 -0.412647341 -0.49317257 -0.01956898 -0.65462877 0.123830400 # V5 -0.3362340 0.16440439 -0.087742529 0.42738358 -0.63669325 0.06930891 0.211018210 # V6 -0.3350675 -0.26126062 0.000691478 -0.49861767 -0.12477294 0.60922054 0.402790095 # V7 -0.3457474 -0.22807676 -0.213071845 -0.01304734 0.22766572 0.29889733 -0.700417365 # V8 -0.3355914 0.03396582 -0.134248356 0.41711347 0.69021015 0.02151820 0.459782742 # V9 -0.2302064 0.90555729 0.080492170 -0.25898781 0.10504168 0.14834515 -0.132116994 # PC8 PC9 # V1 0.24770729 -0.002747438 # V2 -0.43629981 -0.733210938 # V3 -0.58272674 0.667480798 # V4 0.16343403 0.046019211 # V5 0.45866910 0.066890623 # V6 -0.12665288 -0.076510293 # V7 0.38371888 0.062241047 # V8 0.07401187 -0.022078692 # V9 -0.05353693 0.007496101
To see how much variance is accounted for by each principal component, the summary() function is employed as follows:
summary(biopsy_pca) # Importance of components: # PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 # Standard deviation 2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259 0.51062 0.29729 # Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271 0.02897 0.00982 # Cumulative Proportion 0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121 0.99018 1.00000 |
summary(biopsy_pca) # Importance of components: # PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 # Standard deviation 2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259 0.51062 0.29729 # Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271 0.02897 0.00982 # Cumulative Proportion 0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121 0.99018 1.00000
As given in the Proportion of Variance row, the first principal component explains around 65% of the total variance, which corresponds to two-thirds of the information in the data set.
The same variance information can also be retrieved by using the standard deviations of the principal components. Check the code below and compared it with the previous table.
biopsy_pca$sdev^2 / sum(biopsy_pca$sdev^2) # [1] 0.655499928 0.086216321 0.059916916 0.051069717 0.042252870 # [6] 0.033541828 0.032711413 0.028970651 0.009820358 |
biopsy_pca$sdev^2 / sum(biopsy_pca$sdev^2) # [1] 0.655499928 0.086216321 0.059916916 0.051069717 0.042252870 # [6] 0.033541828 0.032711413 0.028970651 0.009820358
Step 2: Visualize the Explained Variance
We can visualize the percentage of explained variance (i.e. the eigenvalues) to see how this first principal component explains a considerable proportion of variance:
fviz_eig(biopsy_pca, addlabels = TRUE, ylim = c(0, 70)) |
fviz_eig(biopsy_pca, addlabels = TRUE, ylim = c(0, 70))
Step 3: Plot the PCA Results
In order to visualize the principal component scores and loading vectors, we can create a biplot, which is a kind of scatterplot. In this plot, the first and the second principal components are shown on the axes:
autoplot(biopsy_pca, data = data_biopsy, loadings = TRUE) |
autoplot(biopsy_pca, data = data_biopsy, loadings = TRUE)
We can visualize how the observations (named as individuals on the plot) with similar (original) variable values are grouped together:
fviz_pca_ind(biopsy_pca, col.ind = "cos2", gradient.cols = c("red","yellow","green3","royalblue"), repel = TRUE) |
fviz_pca_ind(biopsy_pca, col.ind = "cos2", gradient.cols = c("red","yellow","green3","royalblue"), repel = TRUE)
The col.ind = "cos2"
argument indicates the colors used to represent the quality of representation by the principal components. The argument "repel = TRUE"
avoids overlaid text labels.
Next, we can visualize the loading vectors and see how the variables are correlated:
fviz_pca_var(biopsy_pca, col.var = "contrib", gradient.cols = c("red","gold","green3","royalblue")) |
fviz_pca_var(biopsy_pca, col.var = "contrib", gradient.cols = c("red","gold","green3","royalblue"))
The col.var = “contrib” argument indicates the colors used to represent the level of contribution of the variables to the principal components. The variables that are highly correlated are closely located on the plot. See our tutorial for further information on interpreting the loading vectors.
Video, Further Resources & Summary
Do you need more explanations on how to perform a PCA in R? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel.
The YouTube video will be added soon.
Furthermore, you could have a look at some of the other tutorials on Statistics Globe:
- Simulate Bivariate & Multivariate Normal Distribution in R
- Small Area Estimation Techniques
- Statistical Methods for Data Analysis
This post has shown how to perform a PCA in R. In case you have further questions, you may leave a comment below.
This page was created in collaboration with Paula Villasante Soriano. Please have a look at Paula’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.
Statistics Globe Newsletter
4 Comments. Leave new
Very nice!
Hi Barry,
Thanks for the kind feedback, hope the tutorial was helpful!
Regards,
Matthias
Sir, my question is that how we can create the data set with no column name of the first column as in the below data set, and second what should be the structure of data set for PCA analysis?
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Hello Khurram,
If you would like to ignore the column names, you can write rownames(df) <- NULL. Only the continuous variables should include in the analysis and the data should be in a wide format. If you have any further questions, do not hesitate to contact us. Regards, Cansu