GroupBy pandas DataFrame in Python (2 Examples)
In this tutorial you’ll learn how to aggregate a pandas DataFrame by a group column in Python.
Table of contents:
Here’s how to do it…
Example Data & Software Libraries
To be able to use the functions of the pandas library, we first need to import pandas to Python:
import pandas as pd # Import pandas library |
import pandas as pd # Import pandas library
The data below will be used as a basis for this Python programming tutorial:
data = pd.DataFrame({'x1':[6, 5, 3, 2, 5, 8, 9, 7, 2, 8], # Create pandas DataFrame 'x2':range(9, 19), 'group1':['A', 'B', 'A', 'A', 'C', 'C', 'A', 'C', 'B', 'A'], 'group2':['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b']}) print(data) # Print pandas DataFrame |
data = pd.DataFrame({'x1':[6, 5, 3, 2, 5, 8, 9, 7, 2, 8], # Create pandas DataFrame 'x2':range(9, 19), 'group1':['A', 'B', 'A', 'A', 'C', 'C', 'A', 'C', 'B', 'A'], 'group2':['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b']}) print(data) # Print pandas DataFrame
Table 1 shows the structure of our example pandas DataFrame: It has ten rows and four columns. Two of these columns contain integers (i.e. x1 and x2), and two of these columns will be used to group our data set (i.e. group1 and group2).
Example 1: GroupBy pandas DataFrame Based On One Group Column
In this example, I’ll demonstrate how to calculate certain summary statistics for a pandas DataFrame by group based on one grouping column.
For this task, we can use the groupby function. The following Python code returns the mean by group…
print(data.groupby('group1').mean()) # Get mean by group # x1 x2 # group1 # A 5.600000 13.000000 # B 3.500000 13.500000 # C 6.666667 14.333333 |
print(data.groupby('group1').mean()) # Get mean by group # x1 x2 # group1 # A 5.600000 13.000000 # B 3.500000 13.500000 # C 6.666667 14.333333
…the Python syntax below finds the sum by group…
print(data.groupby('group1').sum()) # Get sum by group # x1 x2 # group1 # A 28 65 # B 7 27 # C 20 43 |
print(data.groupby('group1').sum()) # Get sum by group # x1 x2 # group1 # A 28 65 # B 7 27 # C 20 43
…and the following syntax computes the population variance by group:
print(data.groupby('group1').var()) # Get variance by group # x1 x2 # group1 # A 9.300000 12.500000 # B 4.500000 24.500000 # C 2.333333 2.333333 |
print(data.groupby('group1').var()) # Get variance by group # x1 x2 # group1 # A 9.300000 12.500000 # B 4.500000 24.500000 # C 2.333333 2.333333
Example 2: GroupBy pandas DataFrame Based On Multiple Group Columns
In this example, I’ll demonstrate how to apply the groupby function to two different group variables simultaneously.
To accomplish this, we have to specify a list of group indicators within the groupby function.
Below, you can find the syntax to calculate the men by multiple groups…
print(data.groupby(['group1', 'group2']).mean()) # Get mean by multiple groups # x1 x2 # group1 group2 # A a 3.666667 10.666667 # b 8.500000 16.500000 # B a 5.000000 10.000000 # b 2.000000 17.000000 # C a 5.000000 13.000000 # b 7.500000 15.000000 |
print(data.groupby(['group1', 'group2']).mean()) # Get mean by multiple groups # x1 x2 # group1 group2 # A a 3.666667 10.666667 # b 8.500000 16.500000 # B a 5.000000 10.000000 # b 2.000000 17.000000 # C a 5.000000 13.000000 # b 7.500000 15.000000
…the sum by two groups…
print(data.groupby(['group1', 'group2']).sum()) # Get sum by multiple groups # x1 x2 # group1 group2 # A a 11 32 # b 17 33 # B a 5 10 # b 2 17 # C a 5 13 # b 15 30 |
print(data.groupby(['group1', 'group2']).sum()) # Get sum by multiple groups # x1 x2 # group1 group2 # A a 11 32 # b 17 33 # B a 5 10 # b 2 17 # C a 5 13 # b 15 30
…and the variance by multiple groups:
print(data.groupby(['group1', 'group2']).var()) # Get variance by multiple groups # x1 x2 # group1 group2 # A a 4.333333 2.333333 # b 0.500000 4.500000 # B a NaN NaN # b NaN NaN # C a NaN NaN # b 0.500000 2.000000 |
print(data.groupby(['group1', 'group2']).var()) # Get variance by multiple groups # x1 x2 # group1 group2 # A a 4.333333 2.333333 # b 0.500000 4.500000 # B a NaN NaN # b NaN NaN # C a NaN NaN # b 0.500000 2.000000
Video, Further Resources & Summary
If you need further info on the Python codes of this tutorial, I recommend watching the following video on my YouTube channel. In the video, I demonstrate the topics of this article:
The YouTube video will be added soon.
In addition, you could have a look at the related tutorials that I have published on my website.
- Max & Min by Group in Python
- Standard Deviation by Group in Python
- Calculate Mean by Group in Python
- Calculate Sum by Group in Python
- Slice pandas DataFrame by Index in Python in R
- Rename Columns of pandas DataFrame in Python
- Create Subset of Columns of pandas DataFrame in Python
- Rename Column of pandas DataFrame by Index in Python
- How to Use the pandas Library in Python
- Python Programming Examples
In summary: In this article, I have demonstrated how to aggregate the values of a pandas DataFrame by a group indicator in the Python programming language. In case you have further questions, please tell me about it in the comments section below.