# Principal Component Analysis in Python (Example Code)

In this tutorial I’ll explain how to perform a Principal Component Analysis (PCA) using scikit-learn in the Python programming language.

Table of content:

Take a look if you want to learn more about the PCA in Python programming.

## Step 1: Import Libraries and Prepare the Data

First of all, we will need to import some libraries with which we will perform our Python PCA. These will help us with the data analysis, calculation, model building and data visualization:

import numpy as np import seaborn as sns import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA plt.style.use('ggplot') |

import numpy as np import seaborn as sns import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA plt.style.use('ggplot')

In order to perform this Python PCA, we will use a data set from the scikit-learn library (formerly scikits.learn; also known as sklearn). First of all, we will use the load() function from scikit-learn to load our data set, and convert it into a pandas DataFrame:

diabetes = load_diabetes() df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names) df.head(6) |

diabetes = load_diabetes() df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names) df.head(6)

As our data set is very big and has 442 rows and 10 columns, we’ve used the head() function in order to show only the 6 first rows.

Now we can go straight to the analysis and visualization.

## Step 2: Standardize the Data

We will need to scale the features in our DataFrame before applying the PCA: this is a requirement for the optimal performance in the analysis. In order to achieve this, we can use StandardScaler() from scikit-learn, which will help us to standardize the data set’s features onto unit scale, having mean as 0 and variance as 1.

As StandardScaler() is a class, we can create an object inside this class and then use it to fit our matrix:

scaler = StandardScaler() scaler.fit(df) Diabetes_scaled = scaler.transform(df) |

scaler = StandardScaler() scaler.fit(df) Diabetes_scaled = scaler.transform(df)

As a result, we will obtain a two-dimensional NumPy array that will also have 442 rows and 10 columns. If we want to see this array as a DataFrame, we can use this code:

dataframe_scaled = pd.DataFrame(data=Diabetes_scaled, columns=diabetes.feature_names) dataframe_scaled.head(6) |

dataframe_scaled = pd.DataFrame(data=Diabetes_scaled, columns=diabetes.feature_names) dataframe_scaled.head(6)

## Step 3: Perform and Visualize the PCA

Once our data is standardized, we are ready to create our PCA. We can do this using the PCA algorithm from sklearn.decomposition.

First, we will choose a number of components for our PCA. Then, we will transform our DataFrame with those principal components we chose and create a new DataFrame for the PCA:

pca = PCA(n_components=4) PC = pca.fit_transform(Diabetes_scaled) pca_diabetes = pd.DataFrame(data = PC , columns = ['PC1', 'PC2','PC3', 'PC4' ]) pca_diabetes.head(6) |

pca = PCA(n_components=4) PC = pca.fit_transform(Diabetes_scaled) pca_diabetes = pd.DataFrame(data = PC , columns = ['PC1', 'PC2','PC3', 'PC4' ]) pca_diabetes.head(6)

Now, we can visualize the data from the first two principal components:

sns.set() sns.lmplot( x='PC1', y='PC2', data=pca_diabetes, fit_reg=False, legend=True ) plt.title('2D PCA Graph') plt.show() |

sns.set() sns.lmplot( x='PC1', y='PC2', data=pca_diabetes, fit_reg=False, legend=True ) plt.title('2D PCA Graph') plt.show()

## Step 4: Visualize the Explained Variance by each Principal Component

Once we have calculated the principal components, we can check the explained variance by each one of them by using `explained_variance_ratio_`

. We are using a for loop, so we can show the variance for each component:

fig, ax = plt.subplots(nrows=1, ncols=1) ax.bar( x = np.arange(pca.n_components_) + 1, height = pca.explained_variance_ratio_, color= "blue" ) for x, y in zip(np.arange(len(df.columns)) + 1, pca.explained_variance_ratio_): label = round(y, 2) ax.annotate( label, (x,y), textcoords="offset points", xytext=(0,10), ha='center' ) ax.set_xticks(np.arange(pca.n_components_) + 1) ax.set_ylim(0, 1.1) ax.set_title('PCA Explained Variance Ratio') ax.set_xlabel('Principal Components') |

fig, ax = plt.subplots(nrows=1, ncols=1) ax.bar( x = np.arange(pca.n_components_) + 1, height = pca.explained_variance_ratio_, color= "blue" ) for x, y in zip(np.arange(len(df.columns)) + 1, pca.explained_variance_ratio_): label = round(y, 2) ax.annotate( label, (x,y), textcoords="offset points", xytext=(0,10), ha='center' ) ax.set_xticks(np.arange(pca.n_components_) + 1) ax.set_ylim(0, 1.1) ax.set_title('PCA Explained Variance Ratio') ax.set_xlabel('Principal Components')

We can see that the first two components explain 55% of the variance. Together with the third and fourth principal components, they explain 77% of the variance in our data.

## Step 5: Create a New DataFrame Using Principal Components

Once the dimensions of our main DataFrame have been reduced, we can create a new DataFrame choosing the principal components we want to keep. We can keep the four principal components:

pca_diabetes.head(6) |

pca_diabetes.head(6)

But we can also reduce it to two components:

df_new = pd.DataFrame(pca_diabetes, columns=['PC1','PC2']) df_new.head(6) |

df_new = pd.DataFrame(pca_diabetes, columns=['PC1','PC2']) df_new.head(6)

Once we get our desired DataFrame, we can export it.

## Video, Further Resources & Summary

Do you want to learn more about how to perform a PCA using scikit-learn? Then have a look at the following YouTube video of the Statistics Globe YouTube channel.

*The YouTube video will be added soon.*

You can also check some of the other tutorials available in Statistics Globe:

- Axis in pandas DataFrame Explained
- Change Data Type of pandas DataFrame Column in Python
- Combine Two Text Columns of pandas DataFrame in Python

In this post you could read about how to **perform a PCA using scikit-learn in Python**. If you have any further questions, you can leave a comment below.

This page was created in collaboration with Paula Villasante Soriano. Please have a look at Paula’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.

### Statistics Globe Newsletter