Module 4 – Introduction to PCA in R [Course Preview]
This page introduces learners to the basics of Principal Component Analysis (PCA) using R programming. Through video lectures and interactive exercises, participants will work with various data sets, learning to modify and analyze them to understand PCA’s practical applications.
This module covers the fundamentals of PCA application, visualization of eigenvalues through scree plots, and regression modeling using PCA-filtered data, laying the groundwork for understanding PCA’s role in data analysis.
While offering essential theoretical insights, the focus is primarily on hands-on R coding experiences. This foundational knowledge sets the stage for deeper exploration of these concepts in subsequent modules.
Video Lecture
Exercises
We will use a modified version of the mtcars data set for the exercises of this module. This data set provides information on various aspects of 32 automobiles, such as miles per gallon, horsepower, and weight.
Please execute the R code below in preparation for the exercises:
data(mtcars) # Load mtcars data my_mtcars <- mtcars[ , c(1, 3:7)] # Modify mtcars data head(my_mtcars) # First rows of modified data # mpg disp hp drat wt qsec # Mazda RX4 21.0 160 110 3.90 2.620 16.46 # Mazda RX4 Wag 21.0 160 110 3.90 2.875 17.02 # Datsun 710 22.8 108 93 3.85 2.320 18.61 # Hornet 4 Drive 21.4 258 110 3.08 3.215 19.44 # Hornet Sportabout 18.7 360 175 3.15 3.440 17.02 # Valiant 18.1 225 105 2.76 3.460 20.22
Now that we’re all set up, let’s proceed with the exercises.
- Estimate a linear regression model using
mpg
as the target variable and the other variables inmy_mtcars
as predictors. Display the summary of the model. - Apply PCA to the
my_mtcars
data set, excluding thempg
column. Scale the data to standardize it before applying PCA. - Load the
factoextra
package and use it to generate a scree plot visualizing the eigenvalues from the PCA. Highlight the eigenvalue cutoff of 1 with a dashed red line. - Calculate the number of principal components to retain based on their standard deviations (
sdev
) from the PCA output. Principal components should be retained if their associated eigenvalue is greater than 1. - Construct a new data frame that includes only the principal components identified as important in Exercise 4. Add
mpg
frommy_mtcars
as the target variable. - Estimate a linear regression model using the PCA-filtered data frame. Use
mpg
as the target variable and the principal components as predictors. Summarize the results. - Compare the adjusted R-squared values between the regression model using the original
my_mtcars
data set and the model using the PCA-filtered data. Discuss how the introduction of PCA to the modeling process impacts the explanatory power of the model. Note: Adjusted R-squared shows how well your model explains the data, the higher, the better. You can find it at the bottom right of thesummary()
output.
The solutions to these exercises will be published at the bottom of this page after they have been discussed in the group chat.
R Code of This Lecture
Before executing the R code, please download the Introduction to PCA Synthetic Data.csv
file by clicking here. Once downloaded, save it to a suitable location on your computer. Remember to update the file path in the R code provided below to reflect the new location where you’ve saved the file.
library(factoextra) # Load factoextra package my_path <- "D:/Dropbox/Jock/Data Sets/PCA Course/" # Import data my_df <- read.csv2(paste0(my_path, "Introduction to PCA Synthetic Data.csv")) dim(my_df) # Dimensions of data head(my_df) # View data summary(lm(y ~ ., my_df)) # Regression with entire data my_pca <- prcomp(my_df[ , colnames(my_df) != "y"], # Apply PCA scale = TRUE) fviz_eig(my_pca, # Draw scree plot addlabels = TRUE, choice = "eigenvalue", ncp = ncol(my_df)) + geom_hline(yintercept = 1, linetype = "dashed", color = "red") my_pc_numb <- sum(my_pca$sdev^2 > 1) # Get number of components my_pc_numb # Print number of components my_df_pca <- as.data.frame(my_pca$x)[ , 1:my_pc_numb] # Important components my_df_pca$y <- my_df$y # Add target variable to data head(my_df_pca) # View PCA data summary(lm(y ~ ., my_df_pca)) # Regression with PCA data
Exercise Solutions
The solutions to these exercises will be published here after they have been discussed in the group chat.
Further Resources
- PCA in R (Statistics Globe Article)
- PCA in R (Statistics Globe Video)
- PCA in R by Free University of Berlin
.
You can access the course overview page, timetable, and table of contents by clicking here.