calculate across columns in R, mutate across columns, across dplyr

How to do something across columns in R using dplyr

If you know how to efficiently do calculations across columns in the R data frame it can save you a lot of time. One of the best ways to do that with data frames is by using the across function from dplyr.

 

Function across from dplyr is a versatile solution to do calculations across columns in R. That is the reason why it is mentioned in one of my favorite dplyr tips and tricks.

 

In this post are examples of how to:

  1. use dplyr across function across multiple columns,
  2. calculate mean or median across multiple columns,
  3. use across everything with exceptions,
  4. do the calculation if the column name contains a special feature.

 

Calculate across columns in R using mutate from dplyr

Here is the first data frame that I will use.

UPE <- as.data.frame(USPersonalExpenditure)

head(UPE)

#                      1940   1945  1950 1955  1960
#Food and Tobacco    22.200 44.500 59.60 73.2 86.80
#Household Operation 10.500 15.500 29.00 36.5 46.20
#Medical and Health   3.530  5.760  9.71 14.0 21.10
#Personal Care        1.040  1.980  2.45  3.4  5.40
#Private Education    0.341  0.974  1.80  2.6  3.64

Here is how to round values in all columns across the data frame. If you want to learn more about rounding numbers in R, look at this post.

require(dplyr)

UPE %>% mutate(across(everything(), round, digits = 0))

#                    1940 1945 1950 1955 1960
#Food and Tobacco      22   44   60   73   87
#Household Operation   10   16   29   36   46
#Medical and Health     4    6   10   14   21
#Personal Care          1    2    2    3    5
#Private Education      0    1    2    3    4

To do the same, but write the round function differently, try this approach.

UPE %>% mutate(across(everything(), ~round(., digits = 0)))

#                    1940 1945 1950 1955 1960
#Food and Tobacco      22   44   60   73   87
#Household Operation   10   16   29   36   46
#Medical and Health     4    6   10   14   21
#Personal Care          1    2    2    3    5
#Private Education      0    1    2    3    4

If you want to run calculations across everything with exceptions you can specify that like this.

UPE %>% mutate(across(-`1960`, ~round(., digits = 0)))

#                    1940 1945 1950 1955  1960
#Food and Tobacco      22   44   60   73 86.80
#Household Operation   10   16   29   36 46.20
#Medical and Health     4    6   10   14 21.10
#Personal Care          1    2    2    3  5.40
#Private Education      0    1    2    3  3.64

 

Replace NA across columns in R

Here is a data frame that contains NA values.

head(airquality)

#  Ozone Solar.R Wind Temp Month Day
#1    41     190  7.4   67     5   1
#2    36     118  8.0   72     5   2
#3    12     149 12.6   74     5   3
#4    18     313 11.5   62     5   4
#5    NA      NA 14.3   56     5   5
#6    28      NA 14.9   66     5   6

Here is how to replace them across necessary columns.

airquality %>% 
  mutate(across(c(Ozone, Solar.R), tidyr::replace_na, 0)) %>% 
  head()

#  Ozone Solar.R Wind Temp Month Day
#1    41     190  7.4   67     5   1
#2    36     118  8.0   72     5   2
#3    12     149 12.6   74     5   3
#4    18     313 11.5   62     5   4
#5     0       0 14.3   56     5   5
#6    28       0 14.9   66     5   6

Similarly, you can do other transformations across all or certain columns in the data frame, for example, change the data type. Here is an example of how to do that.

Here is another interesting example of how to run function ifelse across data frame columns by using base R or the across from dplyr.

 

Use mutate across columns and create new columns

If you want to add columns with transformations to the R data frame here is how to do that and modify the names of the newly created columns. It is a combination of functions mutate and across. In this example, it is used for the first two columns of the data frame.

UPE %>% mutate(across(1:2, round, digits = 0, .names = "{.col}_rounded"))

#                      1940   1945  1950 1955  1960 1940_rounded 1945_rounded
#Food and Tobacco    22.200 44.500 59.60 73.2 86.80           22           44
#Household Operation 10.500 15.500 29.00 36.5 46.20           10           16
#Medical and Health   3.530  5.760  9.71 14.0 21.10            4            6
#Personal Care        1.040  1.980  2.45  3.4  5.40            1            2
#Private Education    0.341  0.974  1.80  2.6  3.64            0            1

The cool thing about adding columns with function mutate is that you can move them to a certain position at the same time.

UPE %>%
  mutate(across(1:2, round, digits = 0, .names = "{.col}_rounded"),
         .before = `1940`)

#                    1940_rounded 1945_rounded   1940   1945  1950 1955  1960
#Food and Tobacco              22           44 22.200 44.500 59.60 73.2 86.80
#Household Operation           10           16 10.500 15.500 29.00 36.5 46.20
#Medical and Health             4            6  3.530  5.760  9.71 14.0 21.10
#Personal Care                  1            2  1.040  1.980  2.45  3.4  5.40
#Private Education              0            1  0.341  0.974  1.80  2.6  3.64

 

Calculate across columns in R and create a summary

Here is how to create a summary for all numeric data frame columns and calculate the mean by each group.

head(iris)

#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1          5.1         3.5          1.4         0.2  setosa
#2          4.9         3.0          1.4         0.2  setosa
#3          4.7         3.2          1.3         0.2  setosa
#4          4.6         3.1          1.5         0.2  setosa
#5          5.0         3.6          1.4         0.2  setosa
#6          5.4         3.9          1.7         0.4  setosa


iris %>% 
  group_by(Species) %>% 
  summarise(
    across(where(is.numeric), mean, na.rm = TRUE, .names = "{.col}_mean")
    ) %>% 
  print.data.frame()

#     Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
#1     setosa             5.006            3.428             1.462            0.246
#2 versicolor             5.936            2.770             4.260            1.326
#3  virginica             6.588            2.974             5.552            2.026

As you can see, if you want to do calculations across multiple columns that have common features then you can easily do that in combination with function across. Here is another example. Mean calculation for all the columns that end with a certain phrase in the name.

iris %>% 
  group_by(Species) %>% 
  summarise(
    across(ends_with('Width'), mean, na.rm = TRUE, .names = "{.col}_mean")
    ) %>% 
  print.data.frame()


#     Species Sepal.Width_mean Petal.Width_mean
#1     setosa            3.428            0.246
#2 versicolor            2.770            1.326
#3  virginica            2.974            2.026

Here is an example of how to do weighted mean calculation across columns in a similar situation.

There is also possible to do multiple calculations across columns. For example, mean and median at the same time.

iris %>%
  group_by(Species) %>%
  summarise(
    across(where(is.numeric), list(mean = mean, median = median), na.rm = TRUE)
    ) %>%
  select(1:3) %>%
  print.data.frame()

#     Species Sepal.Length_mean Sepal.Length_median
#1     setosa             5.006                 5.0
#2 versicolor             5.936                 5.9
#3  virginica             6.588                 6.5

If you want to run multiple functions across columns and create custom names, here is how to do that.

iris %>%
  group_by(Species) %>%
  summarise(
    across(where(is.numeric)
           , list(mean = mean, median = median)
           , na.rm = TRUE
           , .names = "{.fn}.{.col}")) %>%
  select(1:3) %>%
  print.data.frame()


#     Species mean.Sepal.Length median.Sepal.Length
#1     setosa             5.006                 5.0
#2 versicolor             5.936                 5.9
#3  virginica             6.588                 6.5

 

Here are a couple of other examples with column-wise calculations in R.

If you want to use base R capabilities, then in some situations is easy to apply them. For example, here is how to replace values in a range of columns in R with examples in base R and with across from dplyr.

Here is pure base R with functions from apply family to run calculations across data frame columns.


Posted

in

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *