Mean by Group in R (2 Examples) | dplyr Package vs. Base R

In this tutorial you’ll learn how to compute the mean by group in the R programming language.

I’ll show two different alternatives including reproducible R codes.

Let’s dig into it!

Example Data

For the following examples, I’m going to use the Iris Flower data set. Let’s load the data to R:

data(iris)                                      # Load Iris data
head(iris)                                      # First rows of Iris

nrow function in R - Iris Example Data Frame

Table 1: The Iris Data Matrix.

As you can see based on Table 1, the Iris Flower data contains four numeric columns as well as the grouping factor column Species

Next, I’ll show you how to calculate the average for each of these groups. Keep on reading!

Example 1: Compute Mean by Group in R with aggregate Function

The first example shows how to calculate the mean per group with the aggregate function.

We can compute the mean for each species factor level of the Iris Flower data by applying the aggregate function as follows:

aggregate(x = iris$Sepal.Length,                # Specify data column
          by = list(iris$Species),              # Specify group indicator
          FUN = mean)                           # Specify function (i.e. mean)
 
#    Group.1     x
#     setosa 5.006
# versicolor 5.936
#  virginica 6.588

The RStudio console output shows the mean by group: The setosa group has a mean of 5.006, the versicolor group has a mean of 5.936, and the virginica group has a mean of 6.588.

Note: By replacing the FUN argument of the aggregate function, we can also compute other metrics such as the median, the mode, the variance, or the standard deviation.

Example 2: Compute Mean by Group with dplyr Package

It’s definitely a matter of taste, but many people prefer to use the dplyr package to compute descriptive statistics such as the mean. This example shows how to get the mean by group based on the dplyr environment.

Let’s install and load the dplyr package to R:

install.packages("dplyr")                       # Install dplyr package
library("dplyr")                                # Load dplyr package

Now, we can use all the functions of the dplyr package – in our case group_by and summarise_at:

iris %>%                                        # Specify data frame
  group_by(Species) %>%                         # Specify group indicator
  summarise_at(vars(Sepal.Length),              # Specify column
               list(name = mean))               # Specify function
 
# A tibble: 3 x 2
# Species    Sepal.Length
#              
# setosa             5.01
# versicolor         5.94
# virginica          6.59

The output of the previous R syntax is a tibble instead of a data.frame. However, the results are the same as in Example 1.

Video, Further Resources & Summary

On the Statistics Globe YouTube channel, you can also find a tutorial video, where I explain the content of this topic in some more detail:

This tutorial illustrated how to compute the means for certain data frame subsets (i.e. groups) in the R programming language. In case you want to learn more about the theoretical research concept of the mean, I can recommend the following video of the mathantics YouTube channel:

Furthermore, you could also have a look at some of the related R tutorials that I have published on my website:

I hope you found the tutorial helpful. However, if you have any questions or comments, don’t hesitate to let me know below.

26 Comments. Leave new

Ruben Angulo
July 24, 2019 5:27 pm

After I used the dplyr option I’ve got this warning message

funs() is soft deprecated as of dplyr 0.8.0
please use list() instead
# Before:
funs(name = f(.))
# After:
list(name = ~ f(.))

So, maybe you could update the example code as:

iris%>%
group_by(Species)%>%
summarise_at(vars(Sepal.Length), list(name=mean))

Regards!

Reply
- Joachim
  July 25, 2019 6:12 am
  
  Hey Ruben,
  
  Thank you for the hint, I’ve just changed the code accordingly 🙂
  
  Regards,
  
  Joachim
  
  Reply
Ben Jerome
March 27, 2020 3:59 pm

Hi! I’m fairly new to using R, after finding the mean values of two categorical groups I have how would I plot those means in a bar chart? Thank you

Reply
- Joachim
  March 30, 2020 8:33 am
  
  Hi Ben,
  
  Good question! You may have a look at this tutorial: https://statisticsglobe.com/barplot-in-r
  
  You may use the means of your two groups as height of the bar charts, i.e. store the two means in the vector “values” as shown in Example 1.
  
  Regards,
  
  Joachim
  
  Reply
Julianna
May 4, 2020 7:35 pm

Thank you so much for the easy to follow instructions! I’ve been working on project since last week for my job and this cleared everything up for me in 5 minutes! Thank you!!!

Reply
- Joachim
  May 5, 2020 5:57 am
  
  Thank you for the comment, Julianna. I’m glad to hear that it helped! 🙂
  
  Reply
Alicia
June 20, 2020 1:21 pm

Thanks so much for this! It was very helpful 🙂

Reply
- Joachim
  June 29, 2020 7:18 am
  
  Thanks Alicia, I’m happy to hear that 🙂
  
  Reply
Jan
July 3, 2020 2:18 pm

Thank you for these instructions – very helpful! I am wondering if the output for the dplyr method is rounded? If so, is there a way for the output to not be rounded like the aggregate function?

Thanks!

Reply
- Joachim
  July 6, 2020 9:19 am
  Hi Jan,
  
  Thank you for the comment and the kind words!
  
  Actually, this is a very good question! The dplyr package returns the data in tibble format (in contrast to the data.frame format returned by the aggregate function). Tibbles automatically display the data rounded to two digits. However, the actual values stored in the tibble are NOT rounded.
  
  You can see that by converting the tibble back to data.frame format:
  iris %>% # Specify data frame group_by(Species) %>% # Specify group indicator summarise_at(vars(Sepal.Length), # Specify column list(name = mean)) %>% # Specify function as.data.frame() # Convert tibble to data.frame # Species name # 1 setosa 5.006 # 2 versicolor 5.936 # 3 virginica 6.588
  I also found this thread, which is discussing this topic in some more detail.
  
  I hope that helps!
  
  Joachim
  Reply
Ngoc Mai Le
October 7, 2020 5:53 pm

Please help,
# Calculate the average_pop and median_pop columns
counties_selected %>%
group_by(region, state) %>%
summarize(total_pop = sum(population)) %>% ungroup()%>%
summarize(average_pop = mean(population), median_pop = median(population)
)
I got “object population not found”

Reply
- Joachim
  October 8, 2020 5:49 am
  
  Hey Ngoc,
  
  Thanks for the comment!
  
  I need a few more details. Is counties_selected a data frame? Could you tell me the column names of counties_selected?
  
  Regards,
  
  Joachim
  
  Reply

user1

June 7, 2021 10:43 am

Hi Joachim, thanks for this tutorial! My question: How do I switch rows and columns? So that I for example have only have 2 columns (sex) and x rows. Thank you!
My example: I want do group all my latent variables by sex:
data %>%
group_by(sex) %>%
summarise_at(vars(example1, example2, example3, examplex), list(mean))
If I do so, I get very long rows.

and one hint: I only use “list(mean))” and it name every column correct.
Thank you in advance!

Joachim

June 8, 2021 5:44 am

Hey,

Thank you for the nice comment!

Could you illustrate how your input data looks like? What is contained in the variables example1, example2, and so on?

Regards

Joachim

user1

June 8, 2021 6:19 am

Hi Joachim,

the variables are for example: motivation, work engagement, authentic leadership
they were all measured with 5 or 7-point likert scales and I just want to list the means.

is there any syntax for swichting the rows and columns?

One more question: Where has to be placed the syntax to round the means?
i would like the means to be rounded to two decimal places.

data %>%
group_by(sex) %>%
summarise_at(vars(motivation, work engagement, authentic leadership),
list(mean), round(2))

this doesn’t work…

Joachim

June 8, 2021 9:48 am

Hi again,

I have created some example data to illustrate how you might do that.

Example data:

set.seed(534976)
data <- data.frame(sex = sample(c("Male", "Female"), 10, replace = TRUE),
                   motivation = sample(1:5, 10, replace = TRUE),
                   work_engagement = sample(1:5, 10, replace = TRUE),
                   authentic_leadership = sample(1:5, 10, replace = TRUE))
data
#       sex motivation work_engagement authentic_leadership
# 1  Female          4               3                    5
# 2  Female          4               5                    1
# 3    Male          3               4                    3
# 4    Male          2               4                    1
# 5    Male          3               4                    3
# 6  Female          3               3                    2
# 7    Male          5               2                    2
# 8  Female          1               3                    5
# 9  Female          5               4                    3
# 10 Female          4               3                    1

Calculate mean values by group:

library("dplyr")
 
data_mean <- data %>%
  group_by(sex) %>%
  summarise_at(vars(motivation, work_engagement, authentic_leadership),
               list(mean)) %>% 
  as.data.frame()

Convert data from wide to long format:

library("tidyr")
 
data_mean_long <- as.data.frame(pivot_longer(data = data_mean,
                                             cols = c("motivation",
                                                      "work_engagement",
                                                      "authentic_leadership")))
data_mean_long$value <- round(data_mean_long$value, 2)
data_mean_long
#      sex                 name value
# 1 Female           motivation  3.50
# 2 Female      work_engagement  3.50
# 3 Female authentic_leadership  2.83
# 4   Male           motivation  3.25
# 5   Male      work_engagement  3.50
# 6   Male authentic_leadership  2.25

I hope that helps!

Joachim

cynthia sharon anya
September 15, 2021 9:12 am

Hello
am trying to find the means by group but it brings an error each time
mydata%>%
group_by(mydata$Subspecies)%>%
summarise_at(vars(mydata$Gall.sizes),list(name=mean))
str(mydata)

Reply
- Joachim
  September 16, 2021 6:16 am
  
  Hey Cynthia,
  
  What does the error message say?
  
  Regards
  
  Joachim
  
  Reply

Remya

March 15, 2022 2:00 pm

Hello,
my columns are called frame_number, name, x,y. Is there a possibility to group x by “name” but also the “frame number” at the same time? So that I would get the mean of all x values grouped by name but also just for a specific frame number?
Thanks in advance!

Joachim

March 15, 2022 2:40 pm

Hey Remya,

You may group your data frame by multiple columns as explained here.

Regards,
Joachim

Remya

March 15, 2022 3:19 pm

Hallo Joachim,
ich erkläre es mal auf Deutsch, weil ich mir da leichter tue. Ich möchte den Mittelwert von den “x”-Werten pro Kategorie der Spalte “name” (dort habe ich blau, rot, gelb, usw.) und pro “frame_number”. Ich habe also für jeden Frame und jede Farbe mehrere x-Werte und ich möchte den Mittelwert bilden, damit ich im Endeffekt pro Frame und Farbe (in der Spalte “name”) nur noch einen x-Wert habe.

Vielen vielen Dank für die Antwort!

Joachim

March 16, 2022 7:14 am

Hi Remya,

kein Problem, wir können gerne auf Deutsch schreiben!

Ich habe ein Beispiel erstellt, das dir bei deiner Frage hoffentlich weiterhilft:

# Create example data
set.seed(345678)
data <- data.frame(name = sample(c("blau", "rot", "gelb"), 30, replace = TRUE),
                   frame_number = LETTERS[1:3],
                   x = round(rnorm(30, 5, 5)))
data
#    name frame_number  x
# 1  blau            A -1
# 2  gelb            B -6
# 3  gelb            C  6
# 4  gelb            A  4
# 5   rot            B  2
# 6   rot            C  4
# 7  blau            A  1
# 8  gelb            B 12
# 9   rot            C  2
# 10  rot            A  1
# 11 blau            B -1
# 12 blau            C  8
# 13 gelb            A  5
# 14  rot            B -2
# 15  rot            C -1
# 16 blau            A  0
# 17 gelb            B  1
# 18  rot            C  7
# 19 gelb            A  9
# 20  rot            B  0
# 21  rot            C -2
# 22 gelb            A  6
# 23 gelb            B  5
# 24  rot            C  4
# 25 blau            A  6
# 26 gelb            B  7
# 27 blau            C  7
# 28  rot            A -8
# 29  rot            B  0
# 30 blau            C  9

# Mean by groups based on multiple grouping columns
data_aggr <- aggregate(x ~ name + frame_number, data, mean)
data_aggr
#   name frame_number         x
# 1 blau            A  1.500000
# 2 gelb            A  6.000000
# 3  rot            A -3.500000
# 4 blau            B -1.000000
# 5 gelb            B  3.800000
# 6  rot            B  0.000000
# 7 blau            C  8.000000
# 8 gelb            C  6.000000
# 9  rot            C  2.333333

Viele Grüße,
Joachim

Lang
May 9, 2022 9:42 am

Hi Joachim,

how would you do this if you need to create a new variable, which shows per row which calculates this:

sepal.length for a specific row / mean per category

So for the first row, it would be like this: 5.1 / 5.006 (mean from setosa) = 1.018777. And for the 2nd row like this: 4.9/5.006 = 0.9788254 But then for every observation we have of course.

I can’t seem to figure it out. Thanks in advance

Reply
- Joachim
  May 9, 2022 12:59 pm
  
  Hey Lang,
  
  Thanks for the interesting question. It has inspired me to create a new tutorial on this topic. You can find it here.
  
  I hope that helps!
  
  Joachim
  
  Reply
Faraz
November 18, 2022 9:42 pm

Hi Dear,
I am new to R. I am trying to calculate the mean output of the dynaCop model in R. The model is set to produce output per day, but yield_green is the yield of coffee beans, which come once a year. Therefore, this variable has zeros all year except when beans are harvested, so the mean value appears as 3.7 or 3.9, although the maximum value is 2028 (beans). I want to convert this variable into years and calculate the exact mean value. Because there are many other output variables in the model, the R environment refers to it as “Large list.”
Please Help me

Reply
- Joachim
  November 21, 2022 12:30 pm
  
  Hey Faraz,
  
  Could you please share your code and illustrate the structure of your data in some more detail?
  
  Regards,
  Joachim
  
  Reply