# Module 4 – Introduction to PCA in R [Course Preview]

This page introduces learners to the basics of Principal Component Analysis (PCA) using R programming. Through video lectures and interactive exercises, participants will work with various data sets, learning to modify and analyze them to understand PCA’s practical applications.

This module covers the fundamentals of PCA application, visualization of eigenvalues through scree plots, and regression modeling using PCA-filtered data, laying the groundwork for understanding PCA’s role in data analysis.

While offering essential theoretical insights, the focus is primarily on hands-on R coding experiences. This foundational knowledge sets the stage for deeper exploration of these concepts in subsequent modules.

## Exercises

We will use a modified version of the mtcars data set for the exercises of this module. This data set provides information on various aspects of 32 automobiles, such as miles per gallon, horsepower, and weight.

Please execute the R code below in preparation for the exercises:

```data(mtcars)                                              # Load mtcars data
my_mtcars <- mtcars[ , c(1, 3:7)]                         # Modify mtcars data
head(my_mtcars)                                           # First rows of modified data
#                    mpg disp  hp drat    wt  qsec
# Mazda RX4         21.0  160 110 3.90 2.620 16.46
# Mazda RX4 Wag     21.0  160 110 3.90 2.875 17.02
# Datsun 710        22.8  108  93 3.85 2.320 18.61
# Hornet 4 Drive    21.4  258 110 3.08 3.215 19.44
# Hornet Sportabout 18.7  360 175 3.15 3.440 17.02
# Valiant           18.1  225 105 2.76 3.460 20.22```

Now that we’re all set up, let’s proceed with the exercises.

1. Estimate a linear regression model using `mpg` as the target variable and the other variables in `my_mtcars` as predictors. Display the summary of the model.
2. Apply PCA to the `my_mtcars` data set, excluding the `mpg` column. Scale the data to standardize it before applying PCA.
3. Load the `factoextra` package and use it to generate a scree plot visualizing the eigenvalues from the PCA. Highlight the eigenvalue cutoff of 1 with a dashed red line.
4. Calculate the number of principal components to retain based on their standard deviations (`sdev`) from the PCA output. Principal components should be retained if their associated eigenvalue is greater than 1.
5. Construct a new data frame that includes only the principal components identified as important in Exercise 4. Add `mpg` from `my_mtcars` as the target variable.
6. Estimate a linear regression model using the PCA-filtered data frame. Use `mpg` as the target variable and the principal components as predictors. Summarize the results.
7. Compare the adjusted R-squared values between the regression model using the original `my_mtcars` data set and the model using the PCA-filtered data. Discuss how the introduction of PCA to the modeling process impacts the explanatory power of the model. Note: Adjusted R-squared shows how well your model explains the data, the higher, the better. You can find it at the bottom right of the `summary()` output.

The solutions to these exercises will be published at the bottom of this page after they have been discussed in the group chat.

## R Code of This Lecture

Before executing the R code, please download the `Introduction to PCA Synthetic Data.csv` file by clicking here. Once downloaded, save it to a suitable location on your computer. Remember to update the file path in the R code provided below to reflect the new location where you’ve saved the file.

```library(factoextra)                                       # Load factoextra package

my_path <- "D:/Dropbox/Jock/Data Sets/PCA Course/"        # Import data
my_df <- read.csv2(paste0(my_path, "Introduction to PCA Synthetic Data.csv"))

dim(my_df)                                                # Dimensions of data

summary(lm(y ~ ., my_df))                                 # Regression with entire data

my_pca <- prcomp(my_df[ , colnames(my_df) != "y"],        # Apply PCA
scale = TRUE)

fviz_eig(my_pca,                                          # Draw scree plot
choice = "eigenvalue",
ncp = ncol(my_df)) +
geom_hline(yintercept = 1,
linetype = "dashed",
color = "red")

my_pc_numb <- sum(my_pca\$sdev^2 > 1)                      # Get number of components
my_pc_numb                                                # Print number of components

my_df_pca <- as.data.frame(my_pca\$x)[ , 1:my_pc_numb]     # Important components

my_df_pca\$y <- my_df\$y                                    # Add target variable to data

summary(lm(y ~ ., my_df_pca))                             # Regression with PCA data```

## Exercise Solutions

The solutions to these exercises will be published here after they have been discussed in the group chat.

.