What is a Principal Component Analysis (PCA)? – Tutorial & Example
High dimensional data is hard to explore and visualize. Thus, the main idea of the PCA (Principal Component Analysis) is to reduce the number of variables in a data set while preserving as much information as possible.
In this tutorial, you’ll learn about the steps and application of the Principal Component Analysis.
The table of content is structured as follows:
What is a Principal Component Analysis (PCA)?
Principal Component Analysis (PCA) is a mathematical algorithm in which the objective is to reduce the dimensionality while explaining the most of the variation in the data set.
The variable reduction is accomplished by the linear transformation of the original variables into the new components, which are smaller in number and account for most of the variation in the data.
Step by Step PCA (Example)
There are several steps involved while conducting a PCA. Let’s dive into it.
Step 1: Standardize the Data Set
Let’s say we have a data set with 4 variables and 4 observations, as shown below:
The first step is to standardize, which means transforming all variables as they will have means of zeros and standard deviations of one, hence variances of one.
This is done to ensure that there is no imbalance in the contribution of the variables due to unit differences. Otherwise, the variables that have higher variances would contribute more than the ones with lower variances in identifying the principal components, although it does not reflect reality. For further explanation, see the PCA Using Correlation & Covariance Matrix tutorial.
Regarding standardization formula is given below:
For each feature, the mean and the standard deviation were as follows before the standardization.
As you see, the variability does not vary much for this sample; hence standardization is not a must in this case. However, for the sake of illustration, the variables were standardized, and the following values they have taken.
Step 2: Calculate the Covariance Matrix
In this step, we want to observe the association between the variables in our data set. Therefore, we will calculate the covariance matrix, which actually corresponds to the correlation matrix as the variables are standardized. The following formula is used for the computation.
The resulting covariance matrix is given below.
Positive covariance implies that the variable pair is positively related. In other words, when the magnitude of one variable tends to increase (or decrease), the other does too.
Negative covariance implies that the variable pair is inversely related. In other words, when the magnitude of one variable tends to increase, the other tends to decrease or vice versa.
In our case, only the variables x3 and x4 are negatively correlated, whereas the other variables are positively correlated.
Step 3: Calculate the Eigenvalues and Eigenvectors of the Covariance Matrix
To determine principal components, we need eigenvectors and eigenvalues, which inform us about the directions and the magnitude of the spread of our data. The first thing we need to understand is that they always come in pairs: every eigenvector has an eigenvalue to describe its magnitude.
As early stated, the “principal components” are the new variables that are formed via the linear transformations of the original variables. Eigenvectors give the weights to be used in this linear transformation and eigenvalues tell how much variance is explained by those newly transformed variables.
Ranking our eigenvectors based on their eigenvalues, from the highest to the lowest, allows us to select the principal components, which explain most of the variation in the dataset.
In our case, solving the equation leads to the result below:
λ1 = 1.6698239685
λ2 = 1.0144883673
λ3 = 0.3151205250
λ4 = 0.0005671392
Now we can calculate the eigenvectors. The following result is obtained:
The eigenvalues are ranked in descending order as λ1, λ2, λ3, and λ4. Based on the result, we can choose the top 2 eigenvectors:
For more information on how to select the ideal number of components, you can see our tutorial.
Step 4: Recast the Data
Now we can reorient the data to the new axes: the ones represented by the principal components, hence the original (standardized) variables can be expressed in terms of principal component scores.
Regarding linear transformation is shown below:
The resulting transformed data is as follows:
By this example, it is shown how PCA allows us to reduce the dimensions of our data set while keeping most of the valuable information. For a more profound understanding of the theoretical background of the Principal Component Analysis, you can take a look at:
What is Principal Component Analysis? by Markus Ringnér, published in 2008
Principal Component Analysis by Ian Jolliffe, published in 2002
We also advise you to check our tutorial discussing the pros and cons of conducting PCA.
PCA in Practice (Example)
So far, we explained the PCA theoretically. But we haven’t shown the use of PCA in practice yet. Then let’s see it in an application now!
We will use the pizza dataset for the illustration. The dataset contains the id, brand, and nutritional content for the 300 samples and 10 brands. For a quick overview, see the output below.
In the data, moist, prot, fat, ash, sodium, carb and cal refer to the amount of water, protein, fat, ash, sodium, carbohydrates and calories per 100 grams of the sample.
Based on the given, which brand is the best for you? It should be hard to evaluate all brands and their nutritional contents in one look.
The good news is: If there are meaningful associations among the nutrients, the variability of the brands can be explained by fewer pieces of information than given in the dataset. Let’s take a look at the associations then! See the correlation matrix given in Table 11 below.
As seen in the output, the moisture is negatively correlated with the carbohydrates and calories; the protein is positively associated with the fat, ash and sodium, whereas it is negatively correlated with the carbohydrates; the fat is positively correlated with the ash, sodium and calories whereas it is negatively correlated with the carbohydrates, and so on.
The relations observed in the first three columns already show some patterns. For instance, usually, there is a contrast between the carbohydrates and other nutrients, or the ash, protein, fat, and sodium increase or decrease in the ingredients together.
The data is promising to employ a PCA, considering the observed relational patterns. In other words, new variables (principal components), which are less in number, can be introduced to account for the variability instead of the original variables. For example, a new component can indicate that the pizza is low in ash, protein, fat, and sodium as all these nutrients tend to increase or decrease together.
Without losing too much time, let’s implement the PCA! In the present tutorial, we focus on the theoretical explanation of PCAs only. You can visit our tutorials to learn how to conduct the PCA in R and PCA in Python. So, let’s move straight to the results regarding the pizza dataset.
The proportion of variance in the Table 12 refers to the scaled eigenvalues, which are theoretically explained in Step 3 in the previous section. Based on the result, the first two principal components account for 0.596 + 0.327 = 92.3% of the variance in the data.
Since 92% is a considerable amount, it is sufficient to retain the first two principal components in the analysis. For more information on how to select the ideal number of components, you can visit the tutorial: Choose Optimal Number of Components for PCA.
Now it is time to see the relationship between the retained components and the original variables. The plots showing these relations are called loadings plots, which show the weights (loadings) used in the linear transformation of the original variables to the principal components. See the loadings table and plot below.
Be aware that each row represents the eigenvectors, which are theoretically explained in Step 3 in the previous section. Based on the weights given in Table 13, it is fair to say that PC1 represents being rich in fat, ash and sodium, and poor in carbohydrates, whereas PC2 represents being rich in calories and poor in moisture/water. Let’s now visualize these relations on a loadings plot.
The vectors in Figure 1 show the loadings per variable in Table 13. You can observe that the vector components are 0.065 and -0.628 for moist, 0.379 and -0.27 for prot, etc. The visualization enables a better understanding of what PCs represent and each variable’s share in those representations. For the interpretation, the vector projections are made use of.
Concerning the projections on the PC1 axis, cal, fat, sodium, ash, prot and moist are in the same direction with PC1 in differing magnitudes, whereas carb is in the opposite direction. Considering the projections on the PC2 axis, carb, cal, fat and sodium are in the same direction as PC2 in changing magnitudes, while moist, prot and ash are in the opposite direction in differing magnitudes.
If the relatively higher magnitudes are taken into account:
- PC1 refers to the richness in fat, sodium, ash and protein, and the lack of carbohydrates.
- PC2 refers to the richness in calories and the lack of moisture.
Now it’s time to use this inference in deciding on which brand works the best for you. To do that, we need to plot the PC scores of the pizza samples and the variable loadings on a single plot, which is called biplot in terminology. For further info on biplots, visit the tutorial: Biplot for PCA Explained . Concerning the pizza samples of 10 brands, the following biplot is plotted.
In Figure 2, the axes show the PC scores and the points refer to the individual pizza samples colored by the brand. Additionally, the loading vectors are placed in the center that guide the user through the underlying relations between the original variables and the principal components.
Now we know the samples’ PC scores, the brands, and what PC scores refer to. In light of that information, you can pick the pizza of your preference.
Here are some possible options:
- If you want a fatty and crispy pizza, Brand A would be a good option.
- If you want a soft and non-fat pizza, Brand I would be a good choice.
- If you prefer a balance of nutrients, Brand B and J could be the brands. However, the pizza from Brand B would be still more fat-based than Brand J, which would be more carbohydrate-based.
Which brand would you prefer and why? You can share it in the comments below 🙂 🍕
Video, Further Resources & Summary
Do you need more explanations on how to perform a Principal Component Analysis? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel. In the video the theoretical parts of this article will be explained in much more detail:
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
Furthermore, you could have a look at some of the other tutorials on Statistics Globe:
- PCA Using Correlation & Covariance Matrix
- Advantages & Disadvantages of Principal Component Analysis
- Choose Optimal Number of Components for PCA
- Principal Component Analysis in R
- Principal Component Analysis in Python
- Biplot for PCA Explained
- Can PCA be Used for Categorical Variables?
- Statistical Methods
This post has shown what a PCA is, how to perform it step by step and its use in practice. In case you have further questions, you may leave a comment below.
Statistics Globe Newsletter