How to Draw a plotly Boxplot in Python (Example)
This tutorial provides several examples of plotly boxplots using the Python programming language.
Note: This article was created in collaboration with Kirby White. Kirby is a Statistics Globe author, innovation consultant, data science instructor. His Ph.D. is in Industrial-Organizational Psychology. You can read more about Kirby here!
Overview
Boxplots are one of the most fundamental statistical charts. Boxplots (sometimes called box and whisker plots) are designed to understand the distribution and symmetry of numerical variables. For instance, we could use a boxplot to show the age distribution in a certain country. The box would show the median, 25th, and 75th percentiles, and some variations would visualize the min and max range or the outliers.
Boxplots display a wealth of information, but can appear complex and difficult to understand when you first encounter them! If you’re not familiar with the structure of boxplots yet, you may have a look here.
Modules and Example Data
If you have not already done so, install and load these packages:
from vega_datasets import data import pandas as pd import plotly.express as px
We’ll use the iris
dataset for this example, which is included with the vega datasets. We’ll save this in a data frame called df
.
df = pd.DataFrame(data.iris()) df # sepalLength sepalWidth petalLength petalWidth species # 0 5.1 3.5 1.4 0.2 setosa # 1 4.9 3.0 1.4 0.2 setosa # 2 4.7 3.2 1.3 0.2 setosa
Basic Boxplot
Let’s create a simple boxplot to see the distribution of sepal widths among in these flowers:
fig1 = px.box( data_frame = df ,y = 'sepalLength' ) fig1.show()
A wonderful feature of the plotly library is the hover info. Try to hover your cursor over the graphic to see what the lines and boxes show in this plot.
This graph shows us the distribution for all the sepals measured in this sample, but it would be more helpful to create a separate box to compare the widths across different species of iris. We can do that by mapping the species variable to the x-axis:
fig2 = px.box( data_frame = df ,y = 'sepalLength' ,x = 'species' ) fig2.show()
We can see some clear differences between the boxes in our graph! It appears that the virginica species tends to have the longest sepals, but also has a lot of variation. The dot below the virginica box indicates that this particular data points is likely an outlier (i.e., extremely high or low).
Adding Color
To aid in our comprehension, it can be helpful to use a different color for each species:
fig3 = px.box( data_frame = df ,y = 'sepalLength' ,x = 'species' ,color = 'species' ) fig3.show()
Grouped Boxplot
You can sometimes have multiple values to plot within each group. Plotly prefers that your data be structured in a “long” format for this, so let’s create a second data frame called df_long
:
#only keeping three fields from the original data df_long = df[['species', 'sepalWidth', 'sepalLength']].set_index('species').stack().reset_index() df_long.columns = ['species', 'attribute', 'value'] df_long # species attribute value # 0 setosa sepalWidth 3.5 # 1 setosa sepalLength 5.1 # 2 setosa sepalWidth 3.0 # 3 setosa sepalLength 4.9
Let’s see how we can display the width and length of the sepals with this data:
fig4 = px.box( data_frame = df_long ,y = 'value' ,x = 'species' ,color = 'attribute' ) fig4.show()
Adding Detail
One critique of boxplots is that they over-summarize the data and may unintentionally mask some details in the underlying data. An easy trick to avoid this is to also include a scatterplot adjacent to each box. This shows much more detail by including each record in addition to the summary provided by the boxplot. Each dot’s position along the y-axis is accurate, while any variation along the x-axis is simply to avoid overlapping the data points. On its own, this type of plot is called a jitter plot or a strip plot.
fig5 = px.box( data_frame = df_long ,y = 'value' ,x = 'species' ,color = 'attribute' ,points='all' ) fig5.show()
Notched boxplots
Occasionally, you may be interested in the confidence intervals around the median for each group. This is mostly used by researchers looking for statistically significant differences between groups, and should only be shown to a technical audience. Neverless, plotly makes it easy to include “notches” with each box:
fig6 = px.box( data_frame = df ,y = 'sepalLength' ,x = 'species' ,color = 'species' ,notched = True ) fig6.show()
Other Customizations
Horizontal Orientation
If you wish to change the orientation so that the boxes run horizontally, you can flip the x and y arguments:
fig7 = px.box( data_frame = df ,x = 'sepalLength' ,y = 'species' ,color = 'species' ) fig7.show()
Custom Colors
You can also specify the exact colors to use for each box by passing a dictionary of group:color pairs to the color_discrete_map
argument. You can use the name of most colors, or specify a HEX and RGB code, as shown here:
fig8 = px.box( data_frame = df ,y = 'sepalLength' ,x = 'species' ,color = 'species' ,color_discrete_map={"setosa":"red", "versicolor":"#1d61cf", "virginica":"rgb(20, 150, 96)"} ) fig8.show()
Changing the Box Order
Finally, you can specify which order to display the bars with the category_orders
argument. This is a dictionary that specifies the name of the column as the key, paired with a list of groups in the order you want them shown:
fig9 = px.box( data_frame = df ,y = 'sepalLength' ,x = 'species' ,color = 'species' ,color_discrete_map={"setosa":"red", "versicolor":"#1d61cf", "virginica":"rgb(20, 150, 96)"} ,category_orders={"species":("versicolor", "virginica", "setosa")} ) fig9.show()
Further Resources
You may have a look at these other articles for more detailed examples and videos of popular charts in plotly:
- plotly Barplot in Python
- plotly Histogram in Python
- plotly Line Plot in Python
- plotly Scatterplot in Python
- Introduction to the plotly Package in R
- Introduction to Python
Statistics Globe Newsletter