Visualization of PCA in Python (Examples)

 

In this tutorial, you’ll learn how to visualize your Principal Component Analysis (PCA) in Python.

The table of content is structured as follows:

Paula Villasante Soriano Statistician & R Programmer
This page was created in collaboration with Paula Villasante Soriano and Cansu Kebabci. Please have a look at Paula’s and Cansu’s author pages to get further information about their academic backgrounds and the other articles they have written for Statistics Globe.
Rana Cansu Kebabci Statistician & Data Scientist

 

Let’s take a look at the ways to visualize a PCA in the Python programming language.

 

Data Sample and Add-On Libraries

The first step in this tutorial is to import the needed libraries we will use in the analysis. This can be accomplished by running the code below.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine

Now we will import our data to be used in the examples. In this tutorial, we will use the wine data set from the scikit-learn library. This dataset is composed of 178 rows and 13 columns, and a classification target array referring to the type of wine by the values of 0, 1 and 2.

We will use the load_wine() function to load our dataset. Then, we will convert it into a pandas DataFrame, in which the columns are named by the feature names of the wine dataset, using the pd.DataFrame() function.

wine = load_wine()
DF_data = pd.DataFrame(wine.data,
                       columns = wine.feature_names)

As the next step, we can see what the first rows and columns of our data look like using the head() function.

DF_data.iloc[:, 0:5].head(6)
 
#    alcohol  malic_acid   ash  alcalinity_of_ash  magnesium
# 0    14.23        1.71  2.43               15.6      127.0
# 1    13.20        1.78  2.14               11.2      100.0
# 2    13.16        2.36  2.67               18.6      101.0
# 3    14.37        1.95  2.50               16.8      113.0
# 4    13.24        2.59  2.87               21.0      118.0

 

Perform PCA

First, we need to standardize our data to perform a PCA. You can visit our tutorial Principal Component Analysis in Python for further information on how to implement PCA in Python. For the sake of demonstration, we will analyze a large number of components, such as 6. You can check How to Choose the Optimal Number of Components if you want to learn more about selecting an optimal number of components in PCA. Let’s code it!

scaler = StandardScaler()
 
scaler.fit(DF_data)
 
Wine_scaled = scaler.transform(DF_data)
 
pca = PCA(n_components=6)
 
pca.fit(Wine_scaled)
 
pca_wine = pca.transform(Wine_scaled)

Let’s see how we can visualize our PCA in Python!

 

Visualisation of Observations

After a PCA, the observations are expressed in principal component scores. Therefore, it is important to visualize the spread of the data along the new axes (principal components) to interpret the relations in the dataset. For further information on transforming data to a new coordinate system via PCA, see our extensive tutorial PCA Explained.

The visualization of this data spread, or called point cloud, could be in 2D or 3D. Let’s see in the next section how to do it in 2D space!

 

2D Scatterplot

One might be interested in visualizing the observations in terms of two principal components, which can be achieved by drawing a scatterplot. To do so, first, we need to extract the first two components as follows.

PC1 = pca_wine[:,0]
PC2 = pca_wine[:,1]

In addition to that, if the user has a classification target, like in this case, he might be interested in showing the classification on the plot as well. In such as case, the classification target should be called and stored as follows.

target = wine.target
label = wine.target_names

And now, we can create a pandas DataFrame, which contains the first two principal components, the classification target and the labels, to be used in the scatterplot. As the first step, a list that stores the classification labels is formed via a for loop. Then the zip() and list() functions are used to aggregate the data. In the final step, the pd.DataFrame() function is called to create a DataFrame.

labels=[]
 
for points in target:
    labels.append(label[points])
 
zipped = list(zip(PC1, 
                  PC2, 
                  target,
                  labels))
 
pc_df = pd.DataFrame(zipped, 
                     columns=['PC1', 
                              'PC2', 
                              'Target',
                              'Label'])

Let’s now call the head() function to see what the first rows look like!

pc_df.head(6)
 
#         PC1       PC2  Target    Label
# 0  3.316751 -1.443463       0  class_0
# 1  2.209465  0.333393       0  class_0
# 2  2.516740 -1.031151       0  class_0
# 3  3.757066 -2.756372       0  class_0
# 4  1.008908 -0.869831       0  class_0
# 5  3.050254 -2.122401       0  class_0

Now, we can create a scatterplot by using pc_df previously defined. We will use the scatterplot() function from the seaborn package.

plt.figure(figsize=(12,7))
 
sns.scatterplot(data=pc_df, 
                x="PC1", 
                y="PC2", 
                hue="Label")
 
plt.title("Figure 1: Scatter Plot",
          fontsize=16)
plt.xlabel('First Principal Component',
           fontsize=16)
plt.ylabel('Second Principal Component',
           fontsize=16)

Scatter_plot_pca_visualization

If you want to learn more about how to draw a 2D scatterplot of PCA in Python and what the function arguments stand for, see our tutorial: Scatterplot of PCA in Python.

 

3D Scatterplot

To visualize the observations in 3D space in terms of component scores, one needs to extract the first three principal components: PC1, PC2, and PC3. PC1 and PC2 have already been defined in the previous section. Let’s run the same line of code for the third principal component this time!

PC3 = pca_wine[:,2]

Now, we can draw a 3D scatterplot using fig.add_subplot(111, projection='3d') and the ax.scatter() function of matplotlib. To color each point depending on its class, like in the 2D case, we will use the label and target arrays previously defined. Let’s now plot the scatter plot in 3D!

fig = plt.figure(figsize=(12,7))
ax = fig.add_subplot(111, 
                     projection='3d')
 
for l in np.unique(target):
 ix=np.where(target==l)
 ax.scatter(PC1[ix], 
            PC2[ix], 
            PC3[ix],
           label=label[l])
 
ax.set_xlabel("PC1", 
              fontsize=12)
ax.set_ylabel("PC2", 
              fontsize=12)
ax.set_zlabel("PC3", 
              fontsize=12)
 
ax.view_init(30, 125)
ax.legend()
plt.title("Figure 2: 3D Plot",
          fontsize=16)
plt.show()

3D_graph_PCA

You can also check our tutorial Draw 3D Plot of PCA in Python to see another example of plotting a 3D scatterplot for a PCA.

 

Visualisation of Explained Variance

Visualizing the explained variance per principal component is useful for deciding on the ideal number of components to retain in the analysis. The scree plots are specialized for this kind of visualization in factor analyses.
 

Scree Plot

To plot a scree plot, first, we will create an array containing the principal component numbers via np.arange(pca.n_components_). As the array starts from 0, we will add 1 to the equation to start the x-axis values from 1. See the printed result below.

PC_values = np.arange(pca.n_components_) + 1
 
#array([1, 2, 3, 4, 5, 6])

Now we can plot the principal components versus their explained variance by calling the explained variances via pca.explained_variance_ratio_ and using the defined array PC_values. We will use the plot() function of matplotlib for the implementation.

plt.plot(PC_values, 
         pca.explained_variance_ratio_, 
         'ro-')
plt.title('Figure 3: Scree Plot')
plt.xlabel('Principal Components')
plt.ylabel('Proportion of Variance Explained')
plt.show()

Visualization_graph_scree_plot

Figure 3 shows a scree plot for the first six principal components of our PCA. If you want to learn more about how to interpret a scree plot and how to implement it in Python, check our tutorials: Scree Plot for PCA Explained and Scree Plot in Python.

 

Visualisation of Component-Variable Relation

In order to understand the relation between the principal components and the original variables, a visual that displays both elements are needed. Biplots are used in general for this purpose. They enable the user to grasp what the components represent and each variable’s share in these representations.
 

Biplot

To plot a biplot, we will first define a biplot function to combine the component scores, loading vectors and variable names. To draw all these elements, we will use the plt.scatter(), plt.arrow() and plt.text() functions. In the function, first, the data will be rescaled as the loadings and the component scores are in different scales. Then the observations will be plotted via the plt.scatter function, and the loading vectors and variable names will be added using the plt.arrow() and plt.text() functions in a for loop.

def biplot(score,coef,labels=None):
 
    n = coef.shape[0]
    scalex = 1.0/(PC1.max() - PC1.min())
    scaley = 1.0/(PC2.max() - PC2.min())
 
    plt.scatter(PC1 * scalex,
                PC2 * scaley, 
                c = target)
 
    for i in range(n):
        plt.arrow(0, 0, 
                  coef[i,0], 
                  coef[i,1],
                  color = 'red',
                  alpha = 0.5)
 
        plt.text(coef[i,0]* 1.15, 
                 coef[i,1] * 1.15, 
                 labels[i], 
                 color = 'darkgreen', 
                 ha = 'center', 
                 va = 'center')
 
    plt.xlabel("PC{}".format(1))
    plt.ylabel("PC{}".format(2))

Now we can call our defined function biplot() to plot the graph. You can see the function inputs are the component scores, loading vectors and variable names, as shown below.

plt.figure(figsize=(12,7))
plt.title("Figure 4: Biplot",
          fontsize=16)
biplot(pca_wine, 
       np.transpose(pca.components_), 
       list(wine.feature_names))

draw_biplot_PCA

In Figure 4 you can see the biplot visualizing our PCA. If you need more information to understand biplots and implement them in Python, please check our tutorials: Biplot for PCA Explained and Draw Biplot of PCA in Python.

 

Video, Further Resources & Summary

Do you need more explanations on how to perform a PCA in Python? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel.

 

 

There are other topics you might be interested in:

This post has shown different ways to visualize PCA in Python. In case you have further questions, you may leave a comment below.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top