GroupBy pandas DataFrame in Python (2 Examples)

 

In this tutorial you’ll learn how to aggregate a pandas DataFrame by a group column in Python.

Table of contents:

Here’s how to do it…

 

Example Data & Software Libraries

To be able to use the functions of the pandas library, we first need to import pandas to Python:

import pandas as pd                                        # Import pandas library

The data below will be used as a basis for this Python programming tutorial:

data = pd.DataFrame({'x1':[6, 5, 3, 2, 5, 8, 9, 7, 2, 8],  # Create pandas DataFrame
                     'x2':range(9, 19),
                     'group1':['A', 'B', 'A', 'A', 'C', 'C', 'A', 'C', 'B', 'A'],
                     'group2':['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b']})
print(data)                                                # Print pandas DataFrame

 

table 1 DataFrame groupby pandas dataframe python

 

Table 1 shows the structure of our example pandas DataFrame: It has ten rows and four columns. Two of these columns contain integers (i.e. x1 and x2), and two of these columns will be used to group our data set (i.e. group1 and group2).

 

Example 1: GroupBy pandas DataFrame Based On One Group Column

In this example, I’ll demonstrate how to calculate certain summary statistics for a pandas DataFrame by group based on one grouping column.

For this task, we can use the groupby function. The following Python code returns the mean by group…

print(data.groupby('group1').mean())                       # Get mean by group
#               x1         x2
# group1                     
# A       5.600000  13.000000
# B       3.500000  13.500000
# C       6.666667  14.333333

…the Python syntax below finds the sum by group…

print(data.groupby('group1').sum())                        # Get sum by group
#         x1  x2
# group1        
# A       28  65
# B        7  27
# C       20  43

…and the following syntax computes the population variance by group:

print(data.groupby('group1').var())                        # Get variance by group
#               x1         x2
# group1                     
# A       9.300000  12.500000
# B       4.500000  24.500000
# C       2.333333   2.333333

 

Example 2: GroupBy pandas DataFrame Based On Multiple Group Columns

In this example, I’ll demonstrate how to apply the groupby function to two different group variables simultaneously.

To accomplish this, we have to specify a list of group indicators within the groupby function.

Below, you can find the syntax to calculate the men by multiple groups…

print(data.groupby(['group1', 'group2']).mean())           # Get mean by multiple groups
#                      x1         x2
# group1 group2                     
# A      a       3.666667  10.666667
#        b       8.500000  16.500000
# B      a       5.000000  10.000000
#        b       2.000000  17.000000
# C      a       5.000000  13.000000
#        b       7.500000  15.000000

…the sum by two groups…

print(data.groupby(['group1', 'group2']).sum())            # Get sum by multiple groups
#                x1  x2
# group1 group2        
# A      a       11  32
#        b       17  33
# B      a        5  10
#        b        2  17
# C      a        5  13
#        b       15  30

…and the variance by multiple groups:

print(data.groupby(['group1', 'group2']).var())            # Get variance by multiple groups
#                      x1        x2
# group1 group2                    
# A      a       4.333333  2.333333
#        b       0.500000  4.500000
# B      a            NaN       NaN
#        b            NaN       NaN
# C      a            NaN       NaN
#        b       0.500000  2.000000

 

Video, Further Resources & Summary

If you need further info on the Python codes of this tutorial, I recommend watching the following video on my YouTube channel. In the video, I demonstrate the topics of this article:

 

The YouTube video will be added soon.

 

In addition, you could have a look at the related tutorials that I have published on my website.

 

In summary: In this article, I have demonstrated how to aggregate the values of a pandas DataFrame by a group indicator in the Python programming language. In case you have further questions, please tell me about it in the comments section below.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top