Exploratory Factor Analysis in R (Example)

 

In this tutorial, I’ll explain how to perform exploratory factor analysis (EFA) in the R programming language.

We’ll work with a built-in R dataset containing personality assessment data. This will serve as a hands-on example to showcase how EFA can help discover the underlying patterns in personality traits.

The table of contents is structured as follows:

Let’s dive into it!

Packages & Sample Dataset

Let us first install the packages needed for the data manipulation and analysis steps.

install.packages("psych") # install packages
install.packages("nFactors")
install.packages("reshape2")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("GPArotation")
install.packages("corrplot")
install.packages("Hmisc")

Next, we need to load the respective libraries.

library(psych) # load libraries
library(nFactors)
library(reshape2)
library(ggplot2)
library(dplyr)
library(GPArotation)
library(corrplot)
library(Hmisc)

Once all libraries are set, we can load our sample data. We will use a built-in dataset called bfi, which includes 25 personality related self-report items taken from the International Personality Item Pool as part of the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project.

As seen in the following code, we will use the data() function to load the data to our environment.

data(bfi) # load data

In bfi, the first 25 columns represent personality self-report items (6-point response scale from 1:Very Inccurate to 6: Very Accurate), and the last three are demographic variables.

Since we are only interested in discovering underlying personality factors, I will omit the last three columns as follows.

mydata <- bfi[, 1:25]  # use only 25 personality items

The columns in the original bfi dataset have predefined names that relate directly to specific personality traits. However, to simulate a real-world scenario where you work with an unstructured dataset, I will rename the columns to have a more generic format: Q1, Q2, … Q25.

names(mydata) <- paste0("Q", 1:25) # rename columns
 
head(mydata) # print mydata
#       Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25
# 61617  2  4  3  4  4  2  3  3  4   4   3   3   3   4   4   3   4   2   2   3   3   6   3   4   3
# 61618  2  4  5  2  5  5  4  4  3   4   1   1   6   4   3   3   3   3   5   5   4   2   4   3   3
# 61620  5  4  5  4  4  4  5  4  2   5   2   4   4   4   5   4   5   4   2   3   4   2   5   5   2
# 61621  4  4  6  5  5  4  4  3  5   5   5   3   4   4   4   2   5   2   4   1   3   3   4   3   5
# 61622  2  3  3  4  5  4  4  5  3   2   2   2   5   4   5   2   3   4   4   3   3   3   4   3   3
# 61623  6  6  5  6  5  6  6  6  1   3   2   1   6   5   6   3   5   2   2   3   4   3   5   6   1

The final view of the dataset is as shown above. Let’s now explore our data and prepare it for EFA!

 

Data Exploration & Preparation

We will use the describe() function of the Hmisch package to have an overview of the behavior of our data.

describe(mydata)
#     vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
# Q1     1 2784 2.41 1.41      2    2.23 1.48   1   6     5  0.83    -0.31 0.03
# Q2     2 2773 4.80 1.17      5    4.98 1.48   1   6     5 -1.12     1.05 0.02
# Q3     3 2774 4.60 1.30      5    4.79 1.48   1   6     5 -1.00     0.44 0.02
# Q4     4 2781 4.70 1.48      5    4.93 1.48   1   6     5 -1.03     0.04 0.03
# Q5     5 2784 4.56 1.26      5    4.71 1.48   1   6     5 -0.85     0.16 0.02
# Q6     6 2779 4.50 1.24      5    4.64 1.48   1   6     5 -0.85     0.30 0.02
# Q7     7 2776 4.37 1.32      5    4.50 1.48   1   6     5 -0.74    -0.14 0.03
# Q8     8 2780 4.30 1.29      5    4.42 1.48   1   6     5 -0.69    -0.13 0.02
# Q9     9 2774 2.55 1.38      2    2.41 1.48   1   6     5  0.60    -0.62 0.03
# Q10   10 2784 3.30 1.63      3    3.25 1.48   1   6     5  0.07    -1.22 0.03
# Q11   11 2777 2.97 1.63      3    2.86 1.48   1   6     5  0.37    -1.09 0.03
# Q12   12 2784 3.14 1.61      3    3.06 1.48   1   6     5  0.22    -1.15 0.03
# Q13   13 2775 4.00 1.35      4    4.07 1.48   1   6     5 -0.47    -0.47 0.03
# Q14   14 2791 4.42 1.46      5    4.59 1.48   1   6     5 -0.82    -0.30 0.03
# Q15   15 2779 4.42 1.33      5    4.56 1.48   1   6     5 -0.78    -0.09 0.03
# Q16   16 2778 2.93 1.57      3    2.82 1.48   1   6     5  0.37    -1.01 0.03
# Q17   17 2779 3.51 1.53      4    3.51 1.48   1   6     5 -0.08    -1.05 0.03
# Q18   18 2789 3.22 1.60      3    3.16 1.48   1   6     5  0.15    -1.18 0.03
# Q19   19 2764 3.19 1.57      3    3.12 1.48   1   6     5  0.20    -1.09 0.03
# Q20   20 2771 2.97 1.62      3    2.85 1.48   1   6     5  0.37    -1.06 0.03
# Q21   21 2778 4.82 1.13      5    4.96 1.48   1   6     5 -0.90     0.43 0.02
# Q22   22 2800 2.71 1.57      2    2.56 1.48   1   6     5  0.59    -0.81 0.03
# Q23   23 2772 4.44 1.22      5    4.56 1.48   1   6     5 -0.77     0.30 0.02
# Q24   24 2786 4.89 1.22      5    5.10 1.48   1   6     5 -1.22     1.08 0.02
# Q25   25 2780 2.49 1.33      2    2.34 1.48   1   6     5  0.74    -0.24 0.03

As seen, the bfi dataset is a relatively clean one using a standardized format (6-point scale for all) and excluding any inconsistencies (e.g., no out-of-range value).

Furthermore, outliers do not seem to be a potential issue observing the slight difference between the means and trimmed means. Nonetheless, we can reconsider them based on our analysis findings if necessary.

However, you can see that there is a certain amount of missingness, see the varied sample sizes per variable n. This is definitely worth considering. Let’s dive into it!

Exploring & Handling Missingness

We can start by analyzing the column-wise missingness. First, we will use the is.na() function to convert the data to boolean values, testing missingness. Then we will sum these values by column via colSums() to find the column-wise missingness. In the final step, we will divide the output by the number of rows using nrow() and multiply it by 100 to find the percentages.

col_miss <- round(colSums(is.na(mydata))/nrow(mydata) * 100, 2) # column-wise missingness
col_miss
#   Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9  Q10  Q11  Q12  Q13  Q14  Q15  Q16  Q17  Q18  Q19  Q20  Q21  Q22  Q23  Q24  Q25 
# 0.57 0.96 0.93 0.68 0.57 0.75 0.86 0.71 0.93 0.57 0.82 0.57 0.89 0.32 0.75 0.79 0.75 0.39 1.29 1.04 0.79 0.00 1.00 0.50 0.71

Above, we see the percentage of missingness per variable, which are all low values as less than or around 1%. This indicates that none of the variables exhibit an anomaly.

It’s also crucial to consider case-wise missingness to get a comprehensive view of data completeness. Let’s delve into that aspect next!

First, we will create a data frame where all inputs in mydata are recorded as boolean values based on whether the input is missing. Then, we will group the data by the missingness pattern and calculate the missingness percentage per pattern.

df_miss <- as.data.frame(is.na(mydata)) # create missingness data
 
df_agg <- df_miss %>% # aggregate data by pattern  & calculate frequency
  group_by(across(everything())) %>%
  tally() %>%
  ungroup() %>%
  mutate(perc = n / sum(n) * 100) %>%
  arrange(-perc)

Above, group_by(across(everything())) groups the data by all unique combinations of observations (missingness patterns) in df_miss. The tally() function counts the number of occurences of patterns and store it as n. The percentages are calculated based on the calculated n and ordered in descending order through the arrange() function.

First, let’s see how many different patterns are found by checking the number of rows in previously created data df_agg.

dim(df_agg)[1] # number of rows in df_agg
# [1] 87

Here, we see there are 87 missingness patterns in total.

Let’s now check the largest missingness pattern percentages by printing the first six rows of df_agg.

head(df_agg) # print df_agg
# # A tibble: 6 × 27
#   Q1    Q2    Q3    Q4    Q5    Q6    Q7    Q8    Q9    Q10   Q11   Q12   Q13   Q14   Q15   Q16   Q17   Q18   Q19   Q20   Q21   Q22   Q23   Q24   Q25       n   perc
# <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>    <int>  <dbl>
# 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  2436    87    
# 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE    25  0.893
# 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE FALSE FALSE    17  0.607
# 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    17  0.607
# 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    16  0.571
# 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    16  0.571

As observed, the highest percentage 87% belongs to the pattern of complete cases with all FALSE inputs. The rest of percentages are at negligible levels, which provides evidence that the missingness could be completely at random. Therefore, as a handling method, we will directly omit the incomplete cases for the simplicity.

mydata_nm <- na.omit(mydata)  # remove cases with missingness

If you have a limited sample size, you can also opt for multiple imputation, which requires the evaluation of numerous EFA output.

It is also worth noting that the missingness exploration can be enhanced with graphical representations. Here, we excluded them since the numerical outputs were straightforward to interpret.
 

Correlation Matrix Computation

In order to understand how the variables relate to each other, it is crucial to compute the correlations. We will calculate polychoric correlations considering the categorical nature of our data.

First, we will use the polychoric() function of the psych package for the computation. After that, we will retrieve the correlation coefficients to visualize the results using the corrplot() function of the corrplot package.

polychoric_corr <- polychoric(mydata_nm)$rho # # compute correlations
 
corrplot(polychoric_corr, # visualize correlations
         type = "lower",
         method = "number",
         tl.col = "darkblue",
         number.cex = 0.6,
         tl.cex = 0.6,
         col = COL2('RdBu'))

As seen, the rho element storing the correlations was retrieved and parsed to the corrplot() function. In the function, certain visual elements, like matrix type type, were also specified. If you are interested different visual designs, see this article.

 
polychoric correlation matrix

The subsequent questions have either a strong positive or a strong negative correlation. Also, the question group Q2-Q5 considerably correlate with the question group Q12-Q15. These significant associations are indicators of hidden potential factors in the dataset.

Let’s take this analysis to the next level by computing the KMO statistic!

KMO(polychoric_corr) # check factorability
# Kaiser-Meyer-Olkin factor adequacy
# Call: KMO(r = polychoric_corr)
# Overall MSA =  0.86
# MSA for each item = 
# Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9  Q10  Q11  Q12  Q13  Q14  Q15  Q16  Q17  Q18  Q19  Q20  Q21  Q22  Q23 
# 0.77 0.84 0.88 0.88 0.91 0.86 0.80 0.86 0.84 0.87 0.84 0.89 0.90 0.88 0.90 0.78 0.78 0.87 0.89 0.86 0.88 0.80 0.86 
# Q24  Q25 
# 0.78 0.79

The KMO statistic assesses the adequacy of the data for factor analysis. A result that yields values exceeding 0.8 for each variable, and 0.86 for the complete dataset, suggests that the data is suitable for this analysis.

We are on the right track! Let’s start with deciding the number of factors to extract!

 

Factor Number Estimation

There are different factor selection methods that can be considered. In this tutorial, we will use relatively sophisticated methods called parallel analysis (PA) and minimum average partial (MAP) test.

Let’s start with PA, which compare the eigenvalues of the randomly generated and resampled data with the eigenvalues of the actual data. According to this method, the actual data factors with larger eigenvalues than the corresponding random/resampled data eigenvalues are retained.

We will use the fa.parallel() function of the psych package to return a PA solution with a graphical representation.

set.seed(1234) # set seed for reproducability
 
fa.parallel(mydata_nm, # perform parallel analysis
            fa = "fa",
            fm = "pa",
            show.legend = TRUE,
            main = "Scree Plot and Parallel Analysis")
# Parallel analysis suggests that the number of factors =  6  and the number of components =  NA

As you can see, the analysis suggests six factors. Since we requested only the factor computation by specifying fa = "fa", there is no suggestion regarding the principal component number.

Let’s also check the visual output!

Parallel Analysis Graph

Here, we observe the visual representation of the six factor solution. The simulated data eigenvalues and resampled eigenvalues overlap, which makes the corresponding lines be seen as a single line. This overlap is a result of using a polychoric matrix, see the function documentation.

To make a more informed decision about the number of factors, it’s advisable to use other alternative methods. As mentioned earlier, we will utilize the MAP test to confirm our initial findings.

This method checks the average squared partial correlation under the control of different numbers of factors. According to this method, the ideal factor number is the factor number related to the smallest average squared partial correlation.

VSS(mydata_nm) # perform MAP
# Very Simple Structure
# Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
#           n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
# VSS complexity 1 achieves a maximimum of 0.59  with  4  factors
# VSS complexity 2 achieves a maximimum of 0.75  with  5  factors
# 
# The Velicer MAP achieves a minimum of 0.01  with  5  factors 
# BIC achieves a minimum of  -550.01  with  8  factors
# Sample Size adjusted BIC achieves a minimum of  -143.33  with  8  factors
# 
# Statistics by number of factors 
#   vss1 vss2   map dof chisq     prob sqresid  fit RMSEA  BIC SABIC complex eChisq  SRMR eCRMS  eBIC
# 1 0.50 0.00 0.025 275 10656  0.0e+00    26.0 0.50 0.124 8512  9386     1.0  21181 0.120 0.126 19037
# 2 0.55 0.64 0.019 251  6669  0.0e+00    18.7 0.64 0.102 4712  5509     1.2  11015 0.087 0.095  9058
# 3 0.57 0.71 0.018 228  4585  0.0e+00    14.5 0.72 0.089 2807  3531     1.3   6268 0.065 0.075  4490
# 4 0.59 0.75 0.016 206  3033  0.0e+00    11.3 0.78 0.075 1426  2081     1.4   3129 0.046 0.056  1523
# 5 0.54 0.75 0.015 185  1539 8.6e-212     9.1 0.82 0.055   96   684     1.5   1132 0.028 0.035  -311
# 6 0.54 0.73 0.016 165   917 1.6e-104     8.1 0.84 0.043 -370   154     1.7    553 0.019 0.026  -734
# 7 0.51 0.69 0.019 146   631  1.8e-61     7.7 0.85 0.037 -508   -44     1.9    372 0.016 0.023  -767
# 8 0.51 0.69 0.022 128   448  4.0e-37     7.2 0.86 0.032 -550  -143     1.9    238 0.013 0.020  -760

I have preferred to use the VSS() function since it provides other statistics that one might be interested. Nevertheless, in the scope of this tutorial, we will only focus on the map column above.

Accordingly, five factor solution related to the smallest MAP 0.015 looks optimal. Remember, EFA is inherently an exploratory technique, so it’s essential to remain flexible when evaluating potential solutions.

In light of these results, we will perform our EFA considering both six and five factor solutions in the next section.

Without further ado, let’s jump into EFA!

 

Exploratory Factor Analysis

Let’s start with the six factor solution. All we need is to use the fa() function parsing the specified correlation matrix, rotation and extraction method, and the number of factors.

efa_promax6 <- fa(polychoric_corr, # perform EFA with 6 factors
                         nfactors=6,
                         rotate="promax",
                         fm="pa")

As seen, we input our polychoric correlation matrix polychoric_corr and number of factors 6, and set the rotation to "promax" and the extraction method to principal axis factoring "pa".

We’ve opted for the promax rotation, which is a type of oblique rotation generally preferred over their orthogonal counterparts in practice.

Let’s print the result showing factor loadings above 0.40, which is an arbitrary yet common threshold. You can also print the result without applying any threshold; it’s just for the sake of interpretation.

print(efa_promax6, # print efa_promax6
      cut = 0.4)
 
# Factor Analysis using method =  pa
# Call: fa(r = polychoric_corr, nfactors = 6, rotate = "promax", fm = "pa")
# Standardized loadings (pattern matrix) based upon correlation matrix
#     PA1   PA2   PA3   PA4   PA5   PA6   h2   u2 com
# Q1                          -0.69       0.40 0.60 1.3
# Q2                           0.70       0.60 0.40 1.2
# Q3                           0.53       0.59 0.41 1.9
# Q4                                      0.35 0.65 3.1
# Q5                                      0.55 0.45 3.5
# Q6               0.63                   0.41 0.59 1.2
# Q7               0.79                   0.55 0.45 1.1
# Q8               0.65                   0.35 0.65 1.1
# Q9              -0.72              0.48 0.66 0.34 1.9
# Q10             -0.59                   0.48 0.52 1.6
# Q11 -0.73                               0.44 0.56 1.4
# Q12 -0.78                               0.62 0.38 1.1
# Q13  0.55                               0.54 0.46 2.1
# Q14  0.70                               0.63 0.37 1.4
# Q15  0.54                               0.46 0.54 1.9
# Q16        0.94                         0.76 0.24 1.2
# Q17        0.93                         0.71 0.29 1.2
# Q18        0.76                         0.59 0.41 1.0
# Q19        0.49                         0.55 0.45 2.2
# Q20        0.51                         0.40 0.60 1.8
# Q21                   -0.49             0.42 0.58 2.2
# Q22                    0.54             0.34 0.66 1.2
# Q23                   -0.56             0.54 0.46 2.1
# Q24                   -0.41             0.35 0.65 3.1
# Q25                    0.65             0.43 0.57 1.1
# 
#                        PA1  PA2  PA3  PA4  PA5  PA6
# SS loadings           2.95 2.92 2.34 1.76 1.75 1.01
# Proportion Var        0.12 0.12 0.09 0.07 0.07 0.04
# Cumulative Var        0.12 0.23 0.33 0.40 0.47 0.51
# Proportion Explained  0.23 0.23 0.18 0.14 0.14 0.08
# Cumulative Proportion 0.23 0.46 0.65 0.78 0.92 1.00
# 
# With factor correlations of 
#       PA1   PA2   PA3   PA4   PA5   PA6
# PA1  1.00 -0.38  0.43 -0.11  0.31  0.12
# PA2 -0.38  1.00 -0.22  0.04 -0.18  0.25
# PA3  0.43 -0.22  1.00 -0.19  0.31  0.25
# PA4 -0.11  0.04 -0.19  1.00 -0.02 -0.07
# PA5  0.31 -0.18  0.31 -0.02  1.00  0.34
# PA6  0.12  0.25  0.25 -0.07  0.34  1.00
# 
# Mean item complexity =  1.7
# Test of the hypothesis that 6 factors are sufficient.
# 
# df null model =  300  with the objective function =  9.59
# df of  the model are 165  and the objective function was  0.57 
# 
# The root mean square of the residuals (RMSR) is  0.02 
# The df corrected root mean square of the residuals is  0.03 
# 
# Fit based upon off diagonal values = 0.99
# Measures of factor score adequacy             
#                                                   PA1  PA2  PA3  PA4  PA5  PA6
# Correlation of (regression) scores with factors   0.93 0.95 0.91 0.87 0.89 0.83
# Multiple R square of scores with factors          0.87 0.90 0.83 0.75 0.78 0.69
# Minimum correlation of possible factor scores     0.74 0.80 0.67 0.51 0.57 0.39

As seen, the factor analysis object provides an extensive output. Eventhough, usually, the factor loadings are the main interest in EFA.

Based on the loadings table, certain questions load sufficiently on certain factors, except for Q9. This implies that some groups of questions may measure a similar concept. For example, Q1, Q2, and Q3 might all pertain to a specific personal trait.

However, it is important to note that the choice of the factor extraction and rotation methods have impact on the outcome. Therefore, it is advised to try out multiple extraction and rotation methods to get the most interpretable result, ideally, where each variable loads highly onto one and only one factor, also called simple structure.

Additionally, you can check the explained variance statistics per factor in the second table. For instance, factors explain a total of 51% (0.51) of variance in the dataset. However, we are less interested in the maximum explained variance in EFA, unlike in PCA).

You can observe the correlations of factors in the third table, where you can see some moderate-level correlations. For instance, the correlation between PA1 and PA2 it is -0.38. This implies that these two personal traits are inversely associated.

Finally, you can go over the factor score adequacy statistics in the fourth table to check the reliability of the estimated factor scores, which are the scores of individuals on the extracted factors. Like in regression, Multiple R square R² refers to the proportion of variance of factor scores explained by the question items. It ranges between 0 and 1, with higher values being preferable.

Let’s now check the other potential solution, extracting five factors!

efa_promax5 <- fa(polychoric_corr, # perform EFA with 5 factors
                   nfactors = 5,
                   rotate = "promax",
                   fm = "pa")
 
print(efa_promax5, # print efa_promax5
      cut = 0.4)
# Factor Analysis using method =  pa
# Call: fa(r = polychoric_corr, nfactors = 5, rotate = "promax", fm = "pa")
# Standardized loadings (pattern matrix) based upon correlation matrix
#      PA2   PA1   PA3   PA5   PA4   h2   u2 com
# Q1                    -0.53       0.26 0.74 1.4
# Q2                     0.66       0.54 0.46 1.1
# Q3                     0.64       0.61 0.39 1.3
# Q4                     0.43       0.35 0.65 2.2
# Q5                     0.51       0.53 0.47 1.9
# Q6               0.60             0.40 0.60 1.2
# Q7               0.73             0.51 0.49 1.2
# Q8               0.63             0.36 0.64 1.1
# Q9              -0.70             0.55 0.45 1.2
# Q10             -0.60             0.49 0.51 1.4
# Q11       -0.67                   0.39 0.61 1.1
# Q12       -0.75                   0.61 0.39 1.1
# Q13        0.57                   0.50 0.50 1.7
# Q14        0.71                   0.60 0.40 1.2
# Q15        0.53                   0.46 0.54 1.9
# Q16  0.87                         0.74 0.26 1.3
# Q17  0.83                         0.66 0.34 1.2
# Q18  0.77                         0.59 0.41 1.0
# Q19  0.56                         0.55 0.45 1.8
# Q20  0.56                         0.40 0.60 1.6
# Q21                          0.55 0.39 0.61 1.3
# Q22                         -0.51 0.30 0.70 1.5
# Q23                          0.62 0.54 0.46 1.5
# Q24                          0.45 0.35 0.65 2.6
# Q25                         -0.60 0.36 0.64 1.1
# 
#                       PA2  PA1  PA3  PA5  PA4
# SS loadings           2.95 2.91 2.29 2.04 1.84
# Proportion Var        0.12 0.12 0.09 0.08 0.07
# Cumulative Var        0.12 0.23 0.33 0.41 0.48
# Proportion Explained  0.25 0.24 0.19 0.17 0.15
# Cumulative Proportion 0.25 0.49 0.68 0.85 1.00
# 
# With factor correlations of 
#      PA2   PA1   PA3   PA5  PA4
# PA2  1.00 -0.26 -0.23 -0.02 0.04
# PA1 -0.26  1.00  0.41  0.35 0.16
# PA3 -0.23  0.41  1.00  0.25 0.21
# PA5 -0.02  0.35  0.25  1.00 0.16
# PA4  0.04  0.16  0.21  0.16 1.00
# 
# Mean item complexity =  1.4
# Test of the hypothesis that 5 factors are sufficient.
# 
# df null model =  300  with the objective function =  9.59
# df of  the model are 185  and the objective function was  0.92 
# 
# The root mean square of the residuals (RMSR) is  0.03 
# The df corrected root mean square of the residuals is  0.04 
# 
# Fit based upon off diagonal values = 0.98
# Measures of factor score adequacy             
#                                                   PA2  PA1  PA3  PA5  PA4
# Correlation of (regression) scores with factors   0.94 0.93 0.91 0.89 0.87
# Multiple R square of scores with factors          0.88 0.86 0.82 0.80 0.76
# Minimum correlation of possible factor scores     0.77 0.72 0.64 0.59 0.52

In this solution, we have obtained the ideal simple structure where each question item loads on a unique factor. Perfect!

The next step is reviewing these question groups and trying to find a common concept that these questions seem to measure. This step is more of an art than a science and requires a combination of intuition, domain knowledge, and context.

In a scenario, where we are scientists analyzing this data, it’s imperative to apply our knowledge of psychology to interpret the meaning behind these groupings. For instance, if the first five questions revolve around themes like being cooperative, kind to others, and displaying trustworthiness, we might conclude that they relate to the Agreeableness trait of the Big Five personality traits.

Imagine, after this evaluation step, we concluded that the factors PA1, PA2, PA3, PA4 and PA5 measure Neuroticism, Extraversion, Conscientiousness, Agreeableness and Openness. You can run the code below and find the item explanations in the dataset documentation to confirm this partitioning.

names(bfi[, 16:20]) # original variable names per factor
# [1] "N1" "N2" "N3" "N4" "N5"
names(bfi[, 11:15])
# [1] "E1" "E2" "E3" "E4" "E5"
names(bfi[, 6:10])
# [1] "C1" "C2" "C3" "C4" "C5"
names(bfi[, 1:5])
# [1] "A1" "A2" "A3" "A4" "A5"
names(bfi[, 21:25])
# [1] "O1" "O2" "O3" "O4" "O5"

It looks like we discovered the underlying themes in our dataset! In the next section, we will check if the selected item groups indeed measure the same concept, also called internal consistency or reliability.

 

Reliability Check of Item Groups

One common way to test the internal consistency is to calculate Cronbach’s alpha. Using the alpha() function of the psych package, we can calculate the statistic for each extracted factor. The values over 0.70 are considered acceptable reliability. See the code script below.

alpha1 <- psych::alpha(mydata_nm[ ,16:20], check.keys=TRUE) # cronbach alpha per factor
alpha2 <- psych::alpha(mydata_nm[ ,11:15], check.keys=TRUE)
alpha3 <- psych::alpha(mydata_nm[ ,6:10], check.keys=TRUE)
alpha4 <- psych::alpha(mydata_nm[ ,1:5], check.keys=TRUE)
alpha5 <- psych::alpha(mydata_nm[ ,21:25], check.keys=TRUE)

As seen, we have used the fully qualified name psych::alpha to avoid clashes of function names among loaded packages. Besides, we specified the check.keys argument to TRUE to reverse the items that are negatively yet sufficiently associated with the factor. This is indicated by a negative sign for the variable name.

Let’s now print the first alpha statistic to get an overview of the stored information!

alpha1 # print alpha1
# Reliability analysis   
# Call: psych::alpha(x = mydata_nm[, 16:20], check.keys = TRUE)
# 
# raw_alpha std.alpha G6(smc) average_r S/N    ase mean  sd median_r
#      0.82      0.82     0.8      0.47 4.5 0.0059  3.2 1.2     0.42
# 
# 95% confidence boundaries 
# lower alpha upper
# Feldt     0.81  0.82  0.83
# Duhachek  0.81  0.82  0.83
# 
# Reliability if an item is dropped:
#     raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
# Q16      0.76      0.76    0.71      0.44 3.2   0.0080 0.0061  0.42
# Q17      0.77      0.77    0.72      0.45 3.3   0.0077 0.0056  0.42
# Q18      0.76      0.76    0.74      0.44 3.2   0.0081 0.0186  0.40
# Q19      0.80      0.80    0.77      0.50 4.0   0.0068 0.0192  0.49
# Q20      0.82      0.82    0.79      0.53 4.5   0.0061 0.0140  0.54
# 
# Item statistics 
#        n raw.r std.r r.cor r.drop mean  sd
# Q16 2436  0.81  0.81  0.78   0.68  2.9 1.6
# Q17 2436  0.79  0.79  0.75   0.65  3.5 1.5
# Q18 2436  0.81  0.81  0.74   0.68  3.2 1.6
# Q19 2436  0.72  0.72  0.60   0.55  3.2 1.6
# Q20 2436  0.68  0.67  0.53   0.49  3.0 1.6
# 
# Non missing response frequency for each item
#        1    2    3    4    5    6 miss
# Q16 0.23 0.24 0.15 0.18 0.12 0.07    0
# Q17 0.12 0.19 0.15 0.25 0.18 0.11    0
# Q18 0.17 0.23 0.13 0.22 0.16 0.09    0
# Q19 0.17 0.24 0.15 0.22 0.14 0.09    0
# Q20 0.24 0.24 0.14 0.18 0.12 0.09    0

Above, first, you see the raw alpha statistics for the particular factor, which is 0.82, the alpha for standardized items, std.alpha, which is 0.82. The other statistics like average inter item correlation average_r and signal to noise ratio S/N are also provided.

Later, you see the confidence intervals and alpha statistics in case a particular item is dropped. The fact that the raw_alpha never exceeds the raw alpha 0.82 after the drops provides evidence that none of the items negatively impact the factor’s internal consistency.

In the last two tables, you will see the descriptive statistics like correlations with the factor (raw.r, std.r, etc.), and the response frequencies including missing cases. Remember that we have removed the cases with missingness already at the beginning. Therefore, no missingness is detected.

Everything looks fine for the first factor Neuroticism. If the same applies to the rest, you can visualize your factors and corresponding measuring items. Let’s check how to do it in the next section!

 

Visualisation of Results

The psych package provides the fa.diagram() function for the visualizations of factor-item relations in a diagram. It is enough to parse our EFA object to the function to return it.

fa.diagram(efa_promax5) # draw factor diagram

By default, the loadings above 0.3 and the highest loading per factor are shown. Since our final model has a simple structure and our cutoff is 0.40, we don’t need to change the default settings. See the corresponding visual below.

 

As seen, the negative associations are shown with red dashed lines, while the solid black lines represent the positive ones. The factor partitioning is clearly represented, and the factor correlations above 0.30 are also shown, which gives further insight into the relations in data.

If you want to change the cutoff value, you should set the cut argument to the cutoff value you want. Also, you can set the simple argument to FALSE, if you are interested in showing the cross-loadings (items sufficiently loading on more than one factor).

While fa.diagram() offers a quick way to produce a diagram, it doesn’t allow for fine-grained control over the placement of items and factors. Also, when the model is more complicated, it may not deliver the optimal visual design. Therefore, sometimes using other visualization tools like Lucidchart could be more beneficial for customized designs.

 

Video, Further Resources & Summary

Do you need more explanations on how to conduct EFA in R? Then you might check out the following video of the Statistics Globe YouTube channel.

In the video tutorial, we explain how to employ EFA and interpret the results.

 

The YouTube video will be added soon.

 

Furthermore, you could have a look at some of the other tutorials on Statistics Globe:

This article has demonstrated how to perform EFA in R. If you have further questions, you may leave a comment below.

 

Rana Cansu Kebabci Statistician & Data Scientist

This page was created in collaboration with Cansu Kebabci. You might have a look at Cansu’s author page to get more information about academic background and the other articles she has written for Statistics Globe.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top