Biplot for PCA Explained

Applied to a PCA, a biplot is a type of scatterplot that uses vectors and points to represent the two principal components of a Principal Component Analysis (PCA). But how to interpret it?

 

In this tutorial you’ll learn how to understand the biplot for PCA using the R programming language.

The table of content is structured as follows:

Let’s dive right in.

 

Load the Libraries and the Example Data

Before we start, it is convenient to load the libraries we will be using in this tutorial: the factoextra and the ggfortify packages. If you haven’t installed them yet, please install these libraries now:

install.packages("factoextra")
install.packages("ggfortify")

Next, load the libraries:

library(factoextra)
library(ggfortify)

Now we can start. For this tutorial, we will be using the iris dataset:

data(iris)
head(iris)

data: interpreting biplot pca

The original dataset contains 150 rows and 5 variables, but only the first rows are shown.

 

Perform the PCA and Get the Loadings

We will separate our data in order to perform the PCA, because PCA works better using only numerical data:

iris_data <- iris[,-5]
iris_species <- iris[,5]

Now, we can use the prcomp() function to perform the PCA in the numerical dataset:

iris_pca <- prcomp(iris_data,
             center = TRUE,
             scale. = TRUE)

We can see a summary for our PCA:

summary(iris_pca)
 
# Importance of components:
#                           PC1    PC2     PC3     PC4
# Standard deviation     1.7084 0.9560 0.38309 0.14393
# Proportion of Variance 0.7296 0.2285 0.03669 0.00518
# Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

 

Visualize the PCA and Understand the Biplot

In order to understand the biplot, we can divide it into two parts: the points and the vectors, or in other words, the score plot and the loading plot. The biplot is the combination of these two plots:

autoplot(iris_pca,
         data = iris,
         colour = 'Species',
         main = "Score plot")

biplot pca 1

In the score plot, we can see that the points which are closer between them correspond to those observations of the iris dataset that have similar scores on the principal components. Also, when the components fit the data properly, the points correspond to the observations that have similar values on the variables.

In our example, those points that are closer between them correspond to the same species, and the loadings help us to find out what produces these differences between species.

fviz_pca_var(iris_pca,
             title = "Loadings plot")

biplot pca 2

The loadings describe the importance of the independent values. In the loadings plot, we can see the contribution of each variable to the components, as well as how strongly each variable influences a principal component.

Also, those variables that are positively correlated appear near to each other, as happens with petal length and petal width in our example. Moreover, positive loadings indicate that a variable and a principal component are positively correlated, whereas negative loadings indicate a negative correlation.

It’s also important to mention the two other characteristics that represent the loadings’ vectors:

– The longer the vector, the more variability of this variable is represented by the two principal components.
– The more parallel to a principal component axis a vector is, the more it contributes only to that principal component.

So if we represent together these two graphics, we get a biplot, based on which we can understand the relationship between the principal components and the representation of our data and the variables:

fviz_pca_biplot(iris_pca, 
                col.ind = iris_species)

biplot pca loadings and vectors

As a resume, in a biplot we can see if the different individuals have a high value for the different variables in our dataset. It happens that, if an individual is on the same side of a variable, this means that this individual has a high value for this variable.

Thus, for the iris dataset, where the points represent samples of the iris plant and the vectors correspond to the different variables measured (petal length, petal width, sepal length and sepal width), we can see that those individuals who pertain to the virginica species have high values for the variables petal length, petal width and sepal length.

 

Video, Further Resources & Summary

Do you need more explanations on how to interpret the biplot of a PCA? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel.

 

The YouTube video will be added soon.

 

Additionally, you could have a look at some of the other tutorials on Statistics Globe:

This post has shown how to understand the biplot of a PCA. In case you have questions or comments, you can write them below.

 

Paula Villasante Soriano Statistician & R Programmer

This page was created in collaboration with Paula Villasante Soriano. Please have a look at Paula’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top