Module 4 – Introduction to PCA in R [Course Preview]

This page introduces learners to the basics of Principal Component Analysis (PCA) using R programming. Through video lectures and interactive exercises, participants will work with various data sets, learning to modify and analyze them to understand PCA’s practical applications.

This module covers the fundamentals of PCA application, visualization of eigenvalues through scree plots, and regression modeling using PCA-filtered data, laying the groundwork for understanding PCA’s role in data analysis.

While offering essential theoretical insights, the focus is primarily on hands-on R coding experiences. This foundational knowledge sets the stage for deeper exploration of these concepts in subsequent modules.

Video Lecture

Exercises

We will use a modified version of the mtcars data set for the exercises of this module. This data set provides information on various aspects of 32 automobiles, such as miles per gallon, horsepower, and weight.

Please execute the R code below in preparation for the exercises:

data(mtcars)                                              # Load mtcars data
my_mtcars <- mtcars[ , c(1, 3:7)]                         # Modify mtcars data
head(my_mtcars)                                           # First rows of modified data
#                    mpg disp  hp drat    wt  qsec
# Mazda RX4         21.0  160 110 3.90 2.620 16.46
# Mazda RX4 Wag     21.0  160 110 3.90 2.875 17.02
# Datsun 710        22.8  108  93 3.85 2.320 18.61
# Hornet 4 Drive    21.4  258 110 3.08 3.215 19.44
# Hornet Sportabout 18.7  360 175 3.15 3.440 17.02
# Valiant           18.1  225 105 2.76 3.460 20.22

Now that we’re all set up, let’s proceed with the exercises.

Estimate a linear regression model using mpg as the target variable and the other variables in my_mtcars as predictors. Display the summary of the model.
Apply PCA to the my_mtcars data set, excluding the mpg column. Scale the data to standardize it before applying PCA.
Load the factoextra package and use it to generate a scree plot visualizing the eigenvalues from the PCA. Highlight the eigenvalue cutoff of 1 with a dashed red line.
Calculate the number of principal components to retain based on their standard deviations (sdev) from the PCA output. Principal components should be retained if their associated eigenvalue is greater than 1.
Construct a new data frame that includes only the principal components identified as important in Exercise 4. Add mpg from my_mtcars as the target variable.
Estimate a linear regression model using the PCA-filtered data frame. Use mpg as the target variable and the principal components as predictors. Summarize the results.
Compare the adjusted R-squared values between the regression model using the original my_mtcars data set and the model using the PCA-filtered data. Discuss how the introduction of PCA to the modeling process impacts the explanatory power of the model. Note: Adjusted R-squared shows how well your model explains the data, the higher, the better. You can find it at the bottom right of the summary() output.

The solutions to these exercises will be published at the bottom of this page after they have been discussed in the group chat.

R Code of This Lecture

Before executing the R code, please download the Introduction to PCA Synthetic Data.csv file by clicking here. Once downloaded, save it to a suitable location on your computer. Remember to update the file path in the R code provided below to reflect the new location where you’ve saved the file.

library(factoextra)                                       # Load factoextra package
 
my_path <- "D:/Dropbox/Jock/Data Sets/PCA Course/"        # Import data
my_df <- read.csv2(paste0(my_path, "Introduction to PCA Synthetic Data.csv"))
 
dim(my_df)                                                # Dimensions of data
 
head(my_df)                                               # View data
 
summary(lm(y ~ ., my_df))                                 # Regression with entire data
 
my_pca <- prcomp(my_df[ , colnames(my_df) != "y"],        # Apply PCA
                 scale = TRUE)
 
fviz_eig(my_pca,                                          # Draw scree plot
         addlabels = TRUE,
         choice = "eigenvalue",
         ncp = ncol(my_df)) + 
  geom_hline(yintercept = 1,
             linetype = "dashed",
             color = "red")
 
my_pc_numb <- sum(my_pca$sdev^2 > 1)                      # Get number of components
my_pc_numb                                                # Print number of components
 
my_df_pca <- as.data.frame(my_pca$x)[ , 1:my_pc_numb]     # Important components
 
my_df_pca$y <- my_df$y                                    # Add target variable to data
 
head(my_df_pca)                                           # View PCA data
 
summary(lm(y ~ ., my_df_pca))                             # Regression with PCA data

Exercise Solutions

The solutions to these exercises will be published here after they have been discussed in the group chat.

Further Resources

You can access the course overview page, timetable, and table of contents by clicking here.