# What is a Principal Component Analysis – Tutorial & Example

Smaller data sets make analyzing our data much easier and faster, being easier to visualize and to explore our data. Thus, the main idea of the PCA (Principal Component Analysis) is to reduce the number of variables in a data set, while preserving as much information as possible.

In this tutorial you’ll learn about the PCA meaning and the steps to perform the Principal Component Analysis with an example.

The table of content is structured as follows:

## What is a Principal Component Analysis (PCA)?

Principal Component Analysis (PCA) is a mathematical algorithm which’s objective is to reduce the dimensionality of the data while retaining most of the variation in the data set.

This reduction is accomplished by identifying several directions, known as principal components, which have the most variation in the data. Thus, by using a small number of components, each sample can be represented by a few numbers instead of by a large group of them.

## PCA Step by Step with Example

There are several steps involved while creating a PCA. Let’s dive into it.

### Step 1: Standardize the Data Set

Let’s say we have a data set like this one, with 4 variables and 4 features:

In this first step, the aim is to standardize the range of the continuous initial variables in our data set so that all of them contribute equally to the analysis.

This is done applying the standardization formula you can see below:

$$\large z = \frac{Value\:-\:Mean(μ)}{Standard\: Deviation(σ)}$$

For each feature, the mean and the standard deviation would be:

Once the standardization of the data set is done, all the variables will be transformed to the same scale:

### Step 2: Calculate the Covariance Matrix for the Features in our Data Set

In this step we want to see if there’s a relationship between the variables in our data set, if they are correlated to the point that they provide us redundant information.

In order to see if these correlations happen, we will calculate the covariance matrix for the whole data set using the following formula:

$$\large Cov(x,y) = \frac{\sum_(x_i – \overline{x}) \ast (y_i – \overline{y}) }{N}$$

The covariance matrix defines both the spread (variance) and the orientation (covariance) of our data. For our data set, the covariance matrix will be like this:

The covariance of a variable with itself is its variance (var), and that’s why we can see the variances of each initial variable in the main diagonal. Also, since the covariance is commutative (cov(x1,x2) = cov(x2,x1)), the entries of the covariance matrix are symmetric with respect to the main diagonal.

The result covariance matrix will be:

If covariance turns to be positive, it means that the two variables are correlated: both increase or decrease together.

If it has a negative result, it means that they are inversely correlated, which means that one increases when other decreases.

### Step 3: Calculate the Eigenvalues and the Eigenvectors of the Covariance Matrix

This step allows us to identify the principal components. These “principal components” are new variables which are constructed from the initial variables as linear combinations.

In order to determine these principal components, we need the eigenvectors and the eigenvalues, which let us know the direction and the magnitude of our data. The first thing we need to understand about these is that they always come in pairs: every eigenvector has an eigenvalue.

The eigenvector points in the direction of our data, and it has a corresponding value, called eigenvalue, that describes its magnitude. Each eigenvector has a corresponding eigenvalue, and they help us to determine the principal components of the data.

An eigenvector is a non-zero vector which changes by a scalar factor when linear transformation is applied to it. This scalar factor is the eigenvalue.

Ranking our eigenvectors in order of their eigenvalues, from the highest to the lowest, we will get the principal components we are looking for.

So, how to calculate, first, the eigenvalues and, then, the eigenvectors for our covariance matrix? We will need to use this equation:

$$\large Aν = λν$$

Which rearranged, knowing ν is a non-zero vector:

$$\large det(A-λI) = 0$$

Solving the equation = 0, we obtain 4 different λ values:

λ1 = 1.6698239685
λ2 = 1.0144883673
λ3 = 0.3151205250
λ4 = 0.0005671392

And from the eigenvalues, we can calculate the eigenvectors:

As we can see, eigenvalues are already ranked in descending order, so if we choose the top 2 eigenvectors, our matrix will be:

### Step 4: Recast the Data

In this step the aim is to use the vector we obtained by using the eigenvectors to reorient the data to the new axes: the ones represented by the principal components.

Thus, our final data set will be the product of the feature matrix we’ve obtained with the top 2 eigenvectors and the standardized original matrix obtained in step 1:

So, our transformed data will look like this:

As we can see, the PCA has allowed us to reduce the size of our data set keeping most of its valuable information.

For a more profound understanding of the theoretical background of the Principal Component Analysis, you can take a look at:

What is Principal Component Analysis? by Markus Ringnér, published in 2008

Principal Component Analysis by Ian Jolliffe, published in 2002

## Video, Further Resources & Summary

Do you need more explanations on how to perform a Principal Component Analysis? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel.

Furthermore, you could have a look at some of the other tutorials on Statistics Globe:

This post has shown what a PCA is and how to perform it step by step. In case you have further questions, you may leave a comment below.