Principal Component Analysis in Python (Example Code)

 

In this tutorial I’ll explain how to perform a Principal Component Analysis (PCA) using scikit-learn in the Python programming language.

Table of content:

Take a look if you want to learn more about the PCA in Python programming.

 

Step 1: Import Libraries and Prepare the Data

First of all, we will need to import some libraries with which we will perform our Python PCA. These will help us with the data analysis, calculation, model building and data visualization:

import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
 
plt.style.use('ggplot')

In order to perform this Python PCA, we will use a data set from the scikit-learn library (formerly scikits.learn; also known as sklearn). First of all, we will use the load() function from scikit-learn to load our data set, and convert it into a pandas DataFrame:

diabetes = load_diabetes()
df = pd.DataFrame(data=diabetes.data, 
                  columns=diabetes.feature_names)
 
df.head(6)

pca transform sklearn

As our data set is very big and has 442 rows and 10 columns, we’ve used the head() function in order to show only the 6 first rows.

Now we can go straight to the analysis and visualization.

 

Step 2: Standardize the Data

We will need to scale the features in our DataFrame before applying the PCA: this is a requirement for the optimal performance in the analysis. In order to achieve this, we can use StandardScaler() from scikit-learn, which will help us to standardize the data set’s features onto unit scale, having mean as 0 and variance as 1.

As StandardScaler() is a class, we can create an object inside this class and then use it to fit our matrix:

scaler = StandardScaler()
 
scaler.fit(df)
 
Diabetes_scaled = scaler.transform(df)

As a result, we will obtain a two-dimensional NumPy array that will also have 442 rows and 10 columns. If we want to see this array as a DataFrame, we can use this code:

dataframe_scaled = pd.DataFrame(data=Diabetes_scaled, 
                                columns=diabetes.feature_names)
 
dataframe_scaled.head(6)

sklearn pca

 

Step 3: Perform and Visualize the PCA

Once our data is standardized, we are ready to create our PCA. We can do this using the PCA algorithm from sklearn.decomposition.

First, we will choose a number of components for our PCA. Then, we will transform our DataFrame with those principal components we chose and create a new DataFrame for the PCA:

pca = PCA(n_components=4)
PC = pca.fit_transform(Diabetes_scaled)
pca_diabetes = pd.DataFrame(data = PC
             , columns = ['PC1', 'PC2','PC3', 'PC4' ])
 
pca_diabetes.head(6)

pca python

Now, we can visualize the data from the first two principal components:

sns.set()
 
sns.lmplot(
    x='PC1', 
    y='PC2', 
    data=pca_diabetes,  
    fit_reg=False, 
    legend=True
    )
 
plt.title('2D PCA Graph')
plt.show()

sklearn principal component analysis

 

Step 4: Visualize the Explained Variance by each Principal Component

Once we have calculated the principal components, we can check the explained variance by each one of them by using explained_variance_ratio_. We are using a for loop, so we can show the variance for each component:

fig, ax = plt.subplots(nrows=1, 
                       ncols=1)
ax.bar(
    x      = np.arange(pca.n_components_) + 1,
    height = pca.explained_variance_ratio_, 
    color= "blue"
)
 
for x, y in zip(np.arange(len(df.columns)) + 1, 
                pca.explained_variance_ratio_):
    label = round(y, 2)
    ax.annotate(
        label,
        (x,y),
        textcoords="offset points",
        xytext=(0,10),
        ha='center'
    )
 
ax.set_xticks(np.arange(pca.n_components_) + 1)
ax.set_ylim(0, 1.1)
ax.set_title('PCA Explained Variance Ratio')
ax.set_xlabel('Principal Components')

principal component analysis sklearn

We can see that the first two components explain 55% of the variance. Together with the third and fourth principal components, they explain 77% of the variance in our data.

Step 5: Create a New DataFrame Using Principal Components

Once the dimensions of our main DataFrame have been reduced, we can create a new DataFrame choosing the principal components we want to keep. We can keep the four principal components:

pca_diabetes.head(6)

python pca

But we can also reduce it to two components:

df_new = pd.DataFrame(pca_diabetes, 
                      columns=['PC1','PC2'])
 
df_new.head(6)

sklearn principal component analysis

Once we get our desired DataFrame, we can export it.

 

Video, Further Resources & Summary

Do you want to learn more about how to perform a PCA using scikit-learn? Then have a look at the following YouTube video of the Statistics Globe YouTube channel.

 

The YouTube video will be added soon.

 

You can also check some of the other tutorials available in Statistics Globe:

In this post you could read about how to perform a PCA using scikit-learn in Python. If you have any further questions, you can leave a comment below.

 

Paula Villasante Soriano Statistician & R Programmer

This page was created in collaboration with Paula Villasante Soriano. Please have a look at Paula’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top