Before appending or joining, it might be necessary to compare data frames in R. For example, differences in column names and data types. Sometimes even necessary if you are dealing with an error caused by unmatching columns or data types. Besides that, you can also compare other content.
Here are a few data frames that I will compare.
aq1 <- head(airquality) # Ozone Solar.R Wind Temp Month Day #1 41 190 7.4 67 5 1 #2 36 118 8.0 72 5 2 #3 12 149 12.6 74 5 3 #4 18 313 11.5 62 5 4 #5 NA NA 14.3 56 5 5 #6 28 NA 14.9 66 5 6 aq1_no_col <- head(airquality)[-2] # Ozone Wind Temp Month Day #1 41 7.4 67 5 1 #2 36 8.0 72 5 2 #3 12 12.6 74 5 3 #4 18 11.5 62 5 4 #5 NA 14.3 56 5 5 #6 28 14.9 66 5 6 aq2 <- tail(airquality) aq2$Ozone <- as.character(aq2$Ozone) # Ozone Solar.R Wind Temp Month Day #148 14 20 16.6 63 9 25 #149 30 193 6.9 70 9 26 #150 NA 145 13.2 77 9 27 #151 14 191 14.3 75 9 28 #152 18 131 8.0 76 9 29 #153 20 223 11.5 68 9 30
Compare data frame column names and data types in R
There might be columns that don’t match between data frames by name or data type in R. One of the easiest ways to execute these comparisons is by using functions from the janitor package. For example, the function compare_df_cols can compare multiple data frame columns in a way that functions rbind and bind_rows are evaluating them to combine successfully.
janitor::compare_df_cols(aq1, aq1_no_col) # column_name aq1 aq1_no_col #1 Day integer integer #2 Month integer integer #3 Ozone integer integer #4 Solar.R integer NA #5 Temp integer integer #6 Wind numeric numeric
As you can see, the second data frame doesn’t have column Solar.R. You can extract that by using additional arguments.
janitor::compare_df_cols(aq1, aq1_no_col, return = "mismatch", bind_method = "rbind") # column_name aq1 aq1_no_col #1 Solar.R integer NA
In the scenario, when there are differences in column data types, it is not necessary to specify the bind method argument. In both functions data type is important.
janitor::compare_df_cols(aq1, aq2, return = "mismatch") # column_name aq1 aq2 #1 Ozone integer character
By knowing which columns are different by data type, you can make the necessary changes or try to do them automatically.
Comparison is not limited to only two data frames and you can do that with three or more.
janitor::compare_df_cols(aq1, aq1_no_col, aq2, return = "mismatch", bind_method = "rbind") #column_name aq1 aq1_no_col aq2 #1 Ozone integer integer character #2 Solar.R integer NA integer
Compare data frames in R by content
Function all_equal from dplyr can execute the comparison between data frame column names and data types similar to the janitor.
dplyr::all_equal(aq1, aq1_no_col) #- different number of columns: 6 vs 5
dplyr::all_equal(aq1, aq2) #[1] "- Different types for column `Ozone`: integer vs character"
Meanwhile, it can tell about differences in data and is related to base function all.equal. It appears there is a priority to differences between column names and types, and only after that comes the content.
aq3 <- tail(airquality) dplyr::all_equal(aq1, aq3) #"- Rows in x but not in y: 1, 2, 3, 4, 5, 6 #- Rows in y but not in x: 1, 2, 3, 4, 5, 6"
Another option from dplyr is the function setdiff that can return differences in R data frames.
dplyr::setdiff(aq1, airquality[2:4,]) # Ozone Solar.R Wind Temp Month Day #1 41 190 7.4 67 5 1 #2 NA NA 14.3 56 5 5 #3 28 NA 14.9 66 5 6
If you want to do a comprehensive analysis, try the function comparedf from the package arsenal.
require(arsenal) summary(comparedf(aq1, aq2))
You can create a summary that contains multiple analyses.
#[1] "frame.summary.table" "comparison.summary.table" "vars.ns.table" #[4] "vars.nc.table" "obs.table" "diffs.byvar.table" #[7] "diffs.table" "attrs.table" "control"
If you need only one of them, here is how to get that.
summary(comparedf(aq1, aq2))$vars.nc.table # var.x pos.x class.x var.y pos.y class.y #1 Ozone 1 integer Ozone 1 character
Compare data frame column names with vector
Sometimes it is necessary to select column names by using the vector. Mismatching may lead to this “undefined columns selected” error.
cols_to_keep <- c("Month", "Day", "Temp", "Whatever") aq1[cols_to_keep] #Error in `[.data.frame`(aq1, cols_to_keep) : undefined columns selected
Previously seen methods don’t help in this situation. One of the ways is to use dplyr to select necessary columns. The error message can specify which column is missing.
require(dplyr) aq1 %>% select(all_of(cols_to_keep)) #Error: Can't subset columns that don't exist. #x Column `Whatever` doesn't exist.
Otherwise, you can extract nonmatching values with the content of the vector.
subset(cols_to_keep, is.na( match(cols_to_keep, names(aq1)) ) ) #[1] "Whatever"
Leave a Reply