compare data frames in R, compare column names in R, compare data types in R

Compare data frames in R, column names, data types, and content

Before appending or joining, it might be necessary to compare data frames in R. For example, differences in column names and data types. Sometimes even necessary if you are dealing with an error caused by unmatching columns or data types. Besides that, you can also compare other content.

 

Here are a few data frames that I will compare.

aq1 <- head(airquality)

#  Ozone Solar.R Wind Temp Month Day
#1    41     190  7.4   67     5   1
#2    36     118  8.0   72     5   2
#3    12     149 12.6   74     5   3
#4    18     313 11.5   62     5   4
#5    NA      NA 14.3   56     5   5
#6    28      NA 14.9   66     5   6


aq1_no_col <- head(airquality)[-2]

#  Ozone Wind Temp Month Day
#1    41  7.4   67     5   1
#2    36  8.0   72     5   2
#3    12 12.6   74     5   3
#4    18 11.5   62     5   4
#5    NA 14.3   56     5   5
#6    28 14.9   66     5   6


aq2 <- tail(airquality)
aq2$Ozone <- as.character(aq2$Ozone)

#    Ozone Solar.R Wind Temp Month Day
#148    14      20 16.6   63     9  25
#149    30     193  6.9   70     9  26
#150    NA     145 13.2   77     9  27
#151    14     191 14.3   75     9  28
#152    18     131  8.0   76     9  29
#153    20     223 11.5   68     9  30

 

Compare data frame column names and data types in R

There might be columns that don’t match between data frames by name or data type in R. One of the easiest ways to execute these comparisons is by using functions from the janitor package. For example, the function compare_df_cols can compare multiple data frame columns in a way that functions rbind and bind_rows are evaluating them to combine successfully.

janitor::compare_df_cols(aq1, aq1_no_col)

#  column_name     aq1 aq1_no_col
#1         Day integer    integer
#2       Month integer    integer
#3       Ozone integer    integer
#4     Solar.R integer         NA
#5        Temp integer    integer
#6        Wind numeric    numeric

As you can see, the second data frame doesn’t have column Solar.R. You can extract that by using additional arguments.

janitor::compare_df_cols(aq1, aq1_no_col, return = "mismatch", bind_method = "rbind")

#  column_name     aq1 aq1_no_col
#1     Solar.R integer         NA

In the scenario, when there are differences in column data types, it is not necessary to specify the bind method argument. In both functions data type is important.

janitor::compare_df_cols(aq1, aq2, return = "mismatch")

#  column_name     aq1       aq2
#1       Ozone integer character

By knowing which columns are different by data type, you can make the necessary changes or try to do them automatically.

Comparison is not limited to only two data frames and you can do that with three or more.

janitor::compare_df_cols(aq1,
                         aq1_no_col,
                         aq2,
                         return = "mismatch",
                         bind_method = "rbind")

#column_name     aq1 aq1_no_col       aq2
#1       Ozone integer    integer character
#2     Solar.R integer         NA   integer

 

Compare data frames in R by content

Function all_equal from dplyr can execute the comparison between data frame column names and data types similar to the janitor.

dplyr::all_equal(aq1, aq1_no_col)

#- different number of columns: 6 vs 5
dplyr::all_equal(aq1, aq2)

#[1] "- Different types for column `Ozone`: integer vs character"

Meanwhile, it can tell about differences in data and is related to base function all.equal. It appears there is a priority to differences between column names and types, and only after that comes the content.

aq3 <- tail(airquality)

dplyr::all_equal(aq1, aq3)

#"- Rows in x but not in y: 1, 2, 3, 4, 5, 6
#- Rows in y but not in x: 1, 2, 3, 4, 5, 6"

Another option from dplyr is the function setdiff that can return differences in R data frames.

dplyr::setdiff(aq1, airquality[2:4,])

#  Ozone Solar.R Wind Temp Month Day
#1    41     190  7.4   67     5   1
#2    NA      NA 14.3   56     5   5
#3    28      NA 14.9   66     5   6

If you want to do a comprehensive analysis, try the function comparedf from the package arsenal.

require(arsenal)

summary(comparedf(aq1, aq2))

You can create a summary that contains multiple analyses.

#[1] "frame.summary.table"      "comparison.summary.table" "vars.ns.table"           
#[4] "vars.nc.table"            "obs.table"                "diffs.byvar.table"       
#[7] "diffs.table"              "attrs.table"              "control" 

If you need only one of them, here is how to get that.

summary(comparedf(aq1, aq2))$vars.nc.table

#  var.x pos.x class.x var.y pos.y   class.y
#1 Ozone     1 integer Ozone     1 character

 

Compare data frame column names with vector

Sometimes it is necessary to select column names by using the vector. Mismatching may lead to this “undefined columns selected” error.

cols_to_keep <- c("Month", "Day", "Temp", "Whatever")

aq1[cols_to_keep]

#Error in `[.data.frame`(aq1, cols_to_keep) : undefined columns selected

Previously seen methods don’t help in this situation. One of the ways is to use dplyr to select necessary columns. The error message can specify which column is missing.

require(dplyr)

aq1 %>% select(all_of(cols_to_keep))

#Error: Can't subset columns that don't exist.
#x Column `Whatever` doesn't exist.

Otherwise, you can extract nonmatching values with the content of the vector.

subset(cols_to_keep, 
       is.na(
         match(cols_to_keep, names(aq1))
       )
)

#[1] "Whatever"