# Module 4 – Introduction to PCA in R [Course Preview]

This page introduces learners to the basics of Principal Component Analysis (PCA) using R programming. Through video lectures and interactive exercises, participants will work with various data sets, learning to modify and analyze them to understand PCA’s practical applications.

This module covers the fundamentals of PCA application, visualization of eigenvalues through scree plots, and regression modeling using PCA-filtered data, laying the groundwork for understanding PCA’s role in data analysis.

While offering essential theoretical insights, the focus is primarily on hands-on R coding experiences. This foundational knowledge sets the stage for deeper exploration of these concepts in subsequent modules.

## Video Lecture

## Exercises

We will use a modified version of the mtcars data set for the exercises of this module. This data set provides information on various aspects of 32 automobiles, such as miles per gallon, horsepower, and weight.

Please execute the R code below in preparation for the exercises:

data(mtcars) # Load mtcars data my_mtcars <- mtcars[ , c(1, 3:7)] # Modify mtcars data head(my_mtcars) # First rows of modified data # mpg disp hp drat wt qsec # Mazda RX4 21.0 160 110 3.90 2.620 16.46 # Mazda RX4 Wag 21.0 160 110 3.90 2.875 17.02 # Datsun 710 22.8 108 93 3.85 2.320 18.61 # Hornet 4 Drive 21.4 258 110 3.08 3.215 19.44 # Hornet Sportabout 18.7 360 175 3.15 3.440 17.02 # Valiant 18.1 225 105 2.76 3.460 20.22

Now that we’re all set up, let’s proceed with the exercises.

- Estimate a linear regression model using
`mpg`

as the target variable and the other variables in`my_mtcars`

as predictors. Display the summary of the model. - Apply PCA to the
`my_mtcars`

data set, excluding the`mpg`

column. Scale the data to standardize it before applying PCA. - Load the
`factoextra`

package and use it to generate a scree plot visualizing the eigenvalues from the PCA. Highlight the eigenvalue cutoff of 1 with a dashed red line. - Calculate the number of principal components to retain based on their standard deviations (
`sdev`

) from the PCA output. Principal components should be retained if their associated eigenvalue is greater than 1. - Construct a new data frame that includes only the principal components identified as important in Exercise 4. Add
`mpg`

from`my_mtcars`

as the target variable. - Estimate a linear regression model using the PCA-filtered data frame. Use
`mpg`

as the target variable and the principal components as predictors. Summarize the results. - Compare the adjusted R-squared values between the regression model using the original
`my_mtcars`

data set and the model using the PCA-filtered data. Discuss how the introduction of PCA to the modeling process impacts the explanatory power of the model. Note: Adjusted R-squared shows how well your model explains the data, the higher, the better. You can find it at the bottom right of the`summary()`

output.

*The solutions to these exercises will be published at the bottom of this page after they have been discussed in the group chat.*

## R Code of This Lecture

Before executing the R code, please download the `Introduction to PCA Synthetic Data.csv`

file by clicking here. Once downloaded, save it to a suitable location on your computer. Remember to update the file path in the R code provided below to reflect the new location where you’ve saved the file.

library(factoextra) # Load factoextra package my_path <- "D:/Dropbox/Jock/Data Sets/PCA Course/" # Import data my_df <- read.csv2(paste0(my_path, "Introduction to PCA Synthetic Data.csv")) dim(my_df) # Dimensions of data head(my_df) # View data summary(lm(y ~ ., my_df)) # Regression with entire data my_pca <- prcomp(my_df[ , colnames(my_df) != "y"], # Apply PCA scale = TRUE) fviz_eig(my_pca, # Draw scree plot addlabels = TRUE, choice = "eigenvalue", ncp = ncol(my_df)) + geom_hline(yintercept = 1, linetype = "dashed", color = "red") my_pc_numb <- sum(my_pca$sdev^2 > 1) # Get number of components my_pc_numb # Print number of components my_df_pca <- as.data.frame(my_pca$x)[ , 1:my_pc_numb] # Important components my_df_pca$y <- my_df$y # Add target variable to data head(my_df_pca) # View PCA data summary(lm(y ~ ., my_df_pca)) # Regression with PCA data

## Exercise Solutions

*The solutions to these exercises will be published here after they have been discussed in the group chat.*

## Further Resources

- PCA in R (Statistics Globe Article)
- PCA in R (Statistics Globe Video)
- PCA in R by Free University of Berlin

.

You can access the course overview page, timetable, and table of contents by clicking here.