Can PCA be Used for Categorical Variables? (Alternatives & Example)

 

If you want to reduce the dimensionality of your data frame, you might have thought of using the Principal Component Analysis (PCA). But can PCA be used on a data frame that contains categorical variables?

In this tutorial you’ll learn about the adaptation of Principal Component Analysis (PCA) to those data frames that contain categorical variables. You will also learn how to implement these alternatives using the R programming language.

We will talk about the following:

Continue reading if you are looking for a way to adapt the PCA for categorical variables.

 

Can PCA be Performed on a Data Frame that Contains Categorical Variables?

The answer to this question isn’t easy: actually, it is possible to perform a PCA on a data frame that contains categorical variables, but this doesn’t seem to be the best option.

The main reason for this is that the PCA is designed to work better with numerical or continuous variables, since it involves breaking down its variance structure. As categorical variables don’t have a variance structure (they are not numerical), PCA won’t work well with these.

A possibility in order to perform the PCA in a data set that contains categorical variables is to convert these variables into a series of binary variables, with 0 and 1 values. But especially if we have a data set in which all the variables are binary variables, probably this wouldn’t make sense: there are other alternatives we should consider in order to analyze a data set that contains categorical data.

 

Add-On Libraries

For this tutorial, we will need to use the FactoMineR, the vcd and the factoextra packages. If needed, you can install these packages using the code below:

install.packages("FactoMineR")
install.packages("vcd")
install.packages("factoextra")

Next, load the libraries:

library(FactoMineR)
library(vcd)
library(factoextra)

 

Alternative of PCA for Categorical Variables: Factorial Analysis of Mixed Data (FAMD)

The Factor Analysis of Mixed Data (FAMD) is also a principal component method. This analysis makes it possible to analyze the similarity between individuals by taking into account mixed types of data.

This algorithm has two parts: first, it encodes the data appropriately and, second, it looks iteratively for the K principal components in the data set. This search of principal components works the same way as in PCA.

During the Factorial Analysis of Mixed Data, both quantitative and qualitative variables are normalized. This helps to balance the influence of each set of variables.

An Example of Factorial Analysis of Mixed Data (FAMD) Using R Programming Language

We can see how this analysis works with the R programming language, which gives us the possibility to implement this analysis using the FAMD() function from the FactoMineR package.

In order to explain this example, we will be using a subset of the wine data set from the FactoMineR package:

data(wine)
wine_data <- wine[,c(1,2,13,22,24,28,30)]
head(wine_data)

PCA with categorical variables

Our data set will be composed by 21 rows and 7 columns: the first two columns are categorical variables (label and soil) and the five other columns are numeric. We can see this structure by using the str() function:

str(wine_data)
 
# 'data.frame':	21 obs. of  7 variables:
#  $ Label          : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ...
#  $ Soil           : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ...
#  $ Fruity         : num  2.88 2.56 2.77 2.39 3.16 ...
#  $ Acidity        : num  2.11 2.11 2.18 3.18 2.57 ...
#  $ Alcohol        : num  2.5 2.65 2.64 2.5 2.79 ...
#  $ Intensity      : num  2.86 2.89 3.07 2.46 3.64 ...
#  $ Overall.quality: num  3.39 3.21 3.54 2.46 3.74 ...

Now, we can compute the Factorial Analysis of Mixed Data (FAMD) in our data frame:

wine_famd <- FAMD(wine_data, 
                  graph=FALSE)
wine_famd
 
# *The results are available in the following objects:
 
#   name          description                             
# 1 "$eig"        "eigenvalues and inertia"               
# 2 "$var"        "Results for the variables"             
# 3 "$ind"        "results for the individuals"           
# 4 "$quali.var"  "Results for the qualitative variables" 
# 5 "$quanti.var" "Results for the quantitative variables"

We’ve set the graph= to FALSE, but if we set it to TRUE, we can also see the individuals factor map as well as the graph of the variables, the graph of the categories and the one of the quantitative variables.

We can take a look at how the individuals factor map looks using the fviz_famd_ind() function from the factoextra package, and color the individuals by their cos2 (cos squared) and contribution values in the analysis:

fviz_famd_ind(wine_famd,col.ind = "cos2", 
             gradient.cols = c("blue", "orange", "red"),
             repel = TRUE)

PCA alternative for categorical variables

 

Alternative of PCA for Categorical Variables: Multiple Correspondence Analysis (MCA)

Another alternative to Principal Component Analysis when trying to reduce the dimensions in a data set that contains categorical variables is to use the Multiple Correspondence Analysis (MCA). In fact, this technique is very well known when it comes to categorical data dimension reduction.

This analysis is quite convenient if our data set is composed of categorical variables. It helps us to identify a group of individuals with similar profile and the associations between the categorical variables.

An Example of Multiple Correspondence Analysis (MCA) Using the R Programming Language

We can also implement the Multiple Correspondence Analysis in R by using the MCA() function from the FactoMineR package.

For this example, we will be using a subset from the Arthritis data set from the vcd package:

data(Arthritis)
arthritis_data <- Arthritis[,c(2,3,5)]
head(arthritis_data)

PCA with categorical variables in r

Now, we can implement the MCA() function in our data frame:

arthritis_mca <- MCA(arthritis_data, 
    ncp = 3, 
    graph = FALSE)
 
arthritis_mca
# **Results of the Multiple Correspondence Analysis (MCA)**
# The analysis was performed on 84 individuals, described by 3 variables
# *The results are available in the following objects:
 
#    name              description                       
# 1  "$eig"            "eigenvalues"                     
# 2  "$var"            "results for the variables"       
# 3  "$var$coord"      "coord. of the categories"        
# 4  "$var$cos2"       "cos2 for the categories"         
# 5  "$var$contrib"    "contributions of the categories" 
# 6  "$var$v.test"     "v-test for the categories"       
# 7  "$ind"            "results for the individuals"     
# 8  "$ind$coord"      "coord. for the individuals"      
# 9  "$ind$cos2"       "cos2 for the individuals"        
# 10 "$ind$contrib"    "contributions of the individuals"
# 11 "$call"           "intermediate results"            
# 12 "$call$marge.col" "weights of columns"              
# 13 "$call$marge.li"  "weights of rows"

In this function, we need to specify the number of dimensions (ncp) we want to keep in the final results. In our case, we’ve chosen three dimensions. Also, as happens with the FAMD() function, we can set the graph= to TRUE if we wish to see the MCA factor map for variables and for individuals and the variables representation.

Below we can see the biplot of individuals and variables using the fviz_mca_biplot() function from the factoextra package:

fviz_mca_biplot(arthritis_mca, 
               repel = TRUE, 
               ggtheme = theme_minimal())

pca with categorical variables

As shown, when reducing the dimensions of a data set that contains both numerical and categorical data, we can use the FAMD instead of the PCA. If our data set contains only categorical data, then it would be better to use the MCA.

 

Video, Further Resources & Summary

In case you need more explanations on how to reduce the dimensions in a data set that contains categorical data, then you can have a look at the following YouTube video of the Statistics Globe YouTube channel.

 

The YouTube video will be added soon.

 

You might be interested in some other tutorials on Statistics Globe:

Here you’ve seen which PCA alternative for categorical variables can be implemented. Leave a comment below if you have any questions.

 

Paula Villasante Soriano Statistician & R Programmer

This page was created in collaboration with Paula Villasante Soriano. Please have a look at Paula’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top