How to Compute Summary Statistics by Group in R (3 Examples)
This page shows how to calculate descriptive statistics by group in R.
The article contains the following topics:
If you want to know more about these topics, keep reading!
Construction of Example Data
First, we’ll need to create some exemplifying data:
set.seed(549298) # Create example data data <- data.frame(x = rnorm(500, 1, 3), group = LETTERS[1:5]) head(data) # Print head of example data # x group # 1 0.38324291 A # 2 -0.06604541 B # 3 -1.98454741 C # 4 3.44815045 D # 5 4.11107771 E # 6 4.07278357 A
Have a look at the previous output of the RStudio console. It shows that our exemplifying data has two columns. The variable x contains randomly distributed numeric values and the variable group contains five different grouping labels.
We could return descriptive statistics of our numeric data column x using the summary function as shown below:
summary(data$x) # Summary of entire data # Min. 1st Qu. Median Mean 3rd Qu. Max. # -7.765 -1.045 1.115 1.117 3.151 10.216
However, this would only return the summary statistics of the whole data. In the following examples I’ll therefore show different ways how to get summary statistics for each group of our data.
Keep on reading!
Example 1: Descriptive Summary Statistics by Group Using tapply Function
In this example, I’ll show how to use the basic installation of the R programming language to return descriptive summary statistics by group. More precisely, I’m using the tapply function:
tapply(data$x, data$group, summary) # Summary by group using tapply # $A # Min. 1st Qu. Median Mean 3rd Qu. Max. # -7.236 -1.161 1.530 1.339 3.834 8.747 # # $B # Min. 1st Qu. Median Mean 3rd Qu. Max. # -7.148 -1.002 0.944 1.037 3.004 10.216 # # $C # Min. 1st Qu. Median Mean 3rd Qu. Max. # -6.636 -1.282 1.340 1.030 2.956 8.667 # # $D # Min. 1st Qu. Median Mean 3rd Qu. Max. # -7.7652 -1.2207 0.7849 0.7280 2.3334 8.3459 # # $E # Min. 1st Qu. Median Mean 3rd Qu. Max. # -5.4817 -0.3648 1.5931 1.4498 3.3325 7.6403
The output of the previous R syntax is a list containing one list element for each group. Each of these list elements contains basic summary statistics for the corresponding group.
Example 2: Descriptive Summary Statistics by Group Using dplyr Package
Another alternative for the computation of descriptive summary statistics is provided by the dplyr package.
First, we have to install and load the dplyr package:
install.packages("dplyr") # Install dplyr package library("dplyr") # Load dplyr package
Now, we can apply the group_by and summarize functions to calculate summary statistics by group:
data %>% # Summary by group using dplyr group_by(group) %>% summarize(min = min(x), q1 = quantile(x, 0.25), median = median(x), mean = mean(x), q3 = quantile(x, 0.75), max = max(x)) # # A tibble: 5 x 7 # group min q1 median mean q3 max # <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 A -7.24 -1.16 1.53 1.34 3.83 8.75 # 2 B -7.15 -1.00 0.944 1.04 3.00 10.2 # 3 C -6.64 -1.28 1.34 1.03 2.96 8.67 # 4 D -7.77 -1.22 0.785 0.728 2.33 8.35 # 5 E -5.48 -0.365 1.59 1.45 3.33 7.64
The output of the previous R code is a tibble that contains basically the same values as the list created in Example 1. Whether you prefer to use the basic installation or the dplyr package is a matter of taste.
Example 3: Descriptive Summary Statistics by Group Using purrr Package
In Example 3, I’ll illustrate another alternative for the calculation of summary statistics by group in R.
This example relies on the functions of the purrr package (another add-on package provided by the tidyverse).
We first have to install and load the purrr package:
install.packages("purrr") # Install & load purrr library("purrr")
Now, we can use the following R code to produce another kind of output showing descriptive stats by group:
data %>% # Summary by group using purrr split(.$group) %>% map(summary) # $A # x group # Min. :-7.236 A:100 # 1st Qu.:-1.161 B: 0 # Median : 1.530 C: 0 # Mean : 1.339 D: 0 # 3rd Qu.: 3.834 E: 0 # Max. : 8.747 # # $B # x group # Min. :-7.148 A: 0 # 1st Qu.:-1.002 B:100 # Median : 0.944 C: 0 # Mean : 1.037 D: 0 # 3rd Qu.: 3.004 E: 0 # Max. :10.216 # # $C # x group # Min. :-6.636 A: 0 # 1st Qu.:-1.282 B: 0 # Median : 1.340 C:100 # Mean : 1.030 D: 0 # 3rd Qu.: 2.956 E: 0 # Max. : 8.667 # # $D # x group # Min. :-7.7652 A: 0 # 1st Qu.:-1.2207 B: 0 # Median : 0.7849 C: 0 # Mean : 0.7280 D:100 # 3rd Qu.: 2.3334 E: 0 # Max. : 8.3459 # # $E # x group # Min. :-5.4817 A: 0 # 1st Qu.:-0.3648 B: 0 # Median : 1.5931 C: 0 # Mean : 1.4498 D: 0 # 3rd Qu.: 3.3325 E:100 # Max. : 7.6403
Again, the values are basically the same.
Video, Further Resources & Summary
Have a look at the following video of my YouTube channel. I’m explaining the topics of this article in the video:
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
In addition, I can recommend having a look at the other tutorials on this homepage. A selection of articles can be found below.
In this article, I showed how to get a summary statistics table for each group of a data frame in the R programming language. Don’t hesitate to let me know in the comments section, if you have further questions and/or comments.
Statistics Globe Newsletter
4 Comments. Leave new
Thanks for the tutorial! Just a small note: in the summary by group using dplyr, the function should be ‘summarise’ (with S) instead of ‘summarize’ (with Z).
Hey Giuliana,
Thank you for the kind comment! summarise and summarize are treated the same, though. Have a look here for more details.
Regards,
Joachim
thanks again
You are very welcome Andre! 🙂