mean by group in R, average by group in R

Mean by a group in R

Here is how to calculate the mean by a group in R. If you are an Excel, user you might prefer to say average by a group in R. This post contains multiple scenarios that will ensure that you know some of the pitfalls and tricks.

 

I will use the airquality dataset, which is available in R. This dataset contains columns like temperature with necessary values to calculate the mean by month. Some columns contain NA values, which might be a pitfall in mean calculations, and it is necessary to specify if you want to ignore them.

data("airquality")


head(airquality)

#  Ozone Solar.R Wind Temp Month Day
#1    41     190  7.4   67     5   1
#2    36     118  8.0   72     5   2
#3    12     149 12.6   74     5   3
#4    18     313 11.5   62     5   4
#5    NA      NA 14.3   56     5   5
#6    28      NA 14.9   66     5   6

 

Mean by a group in R

To do necessary calculations by a group, I will use the group_by function from dplyr. It helps to do other calculations similarly, for example, minimum or maximum by group or percentage by a group.

require(dplyr)


airquality %>% 
  group_by(Month) %>% 
  summarise("mean Temp" = mean(Temp))

## A tibble: 5 x 2
#  Month `mean Temp`
#         
#1     5        65.5
#2     6        79.1
#3     7        83.9
#4     8        84.0
#5     9        76.9

Be careful if the column that you are using to calculate the mean by the group in R contains NA values. Use the na.rm parameter in the mean function if you want to avoid them.

airquality %>% 
  group_by(Month) %>% 
  summarise("mean Solar.R" = mean(Solar.R))

## A tibble: 5 x 2
#  Month `mean Solar.R`
#            
#1     5            NA 
#2     6           190.
#3     7           216.
#4     8            NA 
#5     9           167.


airquality %>% 
  group_by(Month) %>% 
  summarise("mean Solar.R" = mean(Solar.R, na.rm = TRUE))

## A tibble: 5 x 2
#  Month `mean Solar.R`
#            
#1     5           181.
#2     6           190.
#3     7           216.
#4     8           172.
#5     9           167.

 

Get mean by a group in a separate column in the unsummarised data set

If you want to calculate the mean by a group in R but don’t want to get the results in a summarised dataset, here is how to do that.

airquality %>% 
  group_by(Month) %>% 
  mutate("mean Temp" = mean(Temp)) %>% 
  head()


## A tibble: 6 x 7
## Groups:   Month [1]
#  Ozone Solar.R  Wind  Temp Month   Day `mean Temp`
#                
#1    41     190   7.4    67     5     1        65.5
#2    36     118   8      72     5     2        65.5
#3    12     149  12.6    74     5     3        65.5
#4    18     313  11.5    62     5     4        65.5
#5    NA      NA  14.3    56     5     5        65.5
#6    28      NA  14.9    66     5     6        65.5

 

Mean by the group that doesn’t exist yet

Here is an interesting way to use the group_by function. If you don’t have content you want to use for grouping, you can create one inside group_by.

For example, in this data frame is a column with a month as s numeric value. You might want to convert the month number to the month name and use that instead.

airquality %>% 
  group_by("MonthName" = month.name[airquality$Month]) %>% 
  summarise("mean Temp" = mean(Temp))

## A tibble: 5 x 2
#  MonthName `mean Temp`
#             
#1 August           84.0
#2 July             83.9
#3 June             79.1
#4 May              65.5
#5 September        76.9

It is important to sort month names as they are during the year. In that case, you have to sort text as numbers in R.

airquality %>%
  group_by("MonthName" = 
             factor(airquality$Month, labels = unique(month.name[airquality$Month]))) %>%
  summarise("mean Temp" = mean(Temp))

## A tibble: 5 x 2
#  MonthName `mean Temp`
#             
#1 May              65.5
#2 June             79.1
#3 July             83.9
#4 August           84.0
#5 September        76.9

Besides this unusual group_by use case, there are other dplyr tips and tricks that you might be interested in.

 

Data in the airqaulity dataset is from 1973. Here is how to create a combination with year and month in R and use that mean by group calculation.

First of all, with the help of lubridate package, I will join the necessary components into the date and, after that, change the format.

require(lubridate)

airquality %>% 
  group_by("Year and Month" = format(make_date(1973, Month, 1), '%B-%Y')) %>% 
  summarise("mean Temp" = mean(Temp)) 

## A tibble: 5 x 2
#  `Year and Month` `mean Temp`
#                    
#1 August-1973             84.0
#2 July-1973               83.9
#3 June-1973               79.1
#4 May-1973                65.5
#5 September-1973          76.9

Here is a good cheat sheet that contains date formatting options in R that might be useful in other situations.

In this blog, you can find other helpful calculations by a group. For example: