How to create BoxPlot in R and extract outliers

Outlier detection is a very broad topic, and boxplot is a part of that. Here is how to create a boxplot in R and extract outliers. There are few things to consider when creating a boxplot in R or anywhere else. Is boxplot showing all the necessary information? Sometimes it is important how many data points you have. Maybe you have to build a boxplot by subcategories?


Here is my data. If you want to know more about how to build an R data frame from scratch, check this out.

DATA <- data.frame(
  DATE  = c("2018-10-01", "2018-10-02", "2018-10-03", "2018-10-04", "2018-10-05", "2018-10-06", "2018-10-07", "2018-10-08", "2018-10-09", "2018-10-10", "2018-10-11", "2018-10-12", "2018-10-13", "2018-10-14", "2018-10-15", "2018-10-16", "2018-10-17", "2018-10-18", "2018-10-19", "2018-10-20", "2018-10-21", "2018-10-22", "2018-10-23", "2018-10-24", "2018-10-25", "2018-10-26", "2018-10-27", "2018-10-28", "2018-10-29", "2018-10-30", "2018-10-31", "2018-11-01", "2018-11-02", "2018-11-03", "2018-11-04"),
  WEEKDAY = as.numeric(c("1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7")),
  VALUE = as.numeric(c("1878", "1754", "1778", "1792", "1638", "820", "950", "1740", "1822", "1680", "1990", "1880", "1504", "808", "1822", "1746", "3748", "2098", "1720", "906", "832", "2092", "2198", "2074", "1892", "1824", "1048", "938", "1790", "2006", "4010", "1808", "1866", "1106", "874"))
)

As you can see in the preview, there might be big enough differences between weekdays or working days and weekends. It is important to investigate outliers also in those categories. To learn more about the method by which outliers are detected with boxplot, go here.

In addition to this data frame, I divided the days into two categories – working day and weekend.

DATA$DAYTYPE <- ifelse(DATA$WEEKDAY >5, "WEEKEND", "WORKING DAY")

Boxplot by subcategories

It is easily done with a base function boxplot.

boxplot(DATA$VALUE ~ DATA$DAYTYPE)

With boxplot()$out you can take a look at the outliers by each subcategory.

boxplot(DATA$VALUE ~ DATA$DAYTYPE)$out

How to extract R data frame rows with boxplot outliers

To get all rows from the data frame that contains boxplot detected outliers, you can use a subset function.

subset(DATA, DATA$VALUE %in% boxplot(DATA$VALUE ~ DATA$DAYTYPE)$out)

To successfully visualize boxplot with all data points and highlight outliers in another color, I made some additional columns to my data frame – OUTLIER and INLIER. As you can see, I added plot argument to boxplot function, because otherwise the plot is made by default.

The initial code looks like this.

DATA <- data.frame(
  DATE  = c("2018-10-01", "2018-10-02", "2018-10-03", "2018-10-04", "2018-10-05", "2018-10-06", "2018-10-07", "2018-10-08", "2018-10-09", "2018-10-10", "2018-10-11", "2018-10-12", "2018-10-13", "2018-10-14", "2018-10-15", "2018-10-16", "2018-10-17", "2018-10-18", "2018-10-19", "2018-10-20", "2018-10-21", "2018-10-22", "2018-10-23", "2018-10-24", "2018-10-25", "2018-10-26", "2018-10-27", "2018-10-28", "2018-10-29", "2018-10-30", "2018-10-31", "2018-11-01", "2018-11-02", "2018-11-03", "2018-11-04"),
  WEEKDAY = as.numeric(c("1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7")),
  VALUE = as.numeric(c("1878", "1754", "1778", "1792", "1638", "820", "950", "1740", "1822", "1680", "1990", "1880", "1504", "808", "1822", "1746", "3748", "2098", "1720", "906", "832", "2092", "2198", "2074", "1892", "1824", "1048", "938", "1790", "2006", "4010", "1808", "1866", "1106", "874"))
)

DATA$DAYTYPE <- ifelse(DATA$WEEKDAY >5, "WEEKEND", "WORKING DAY")

# base boxplot and outliers
boxplot(DATA$VALUE ~ DATA$DAYTYPE)
boxplot(DATA$VALUE ~ DATA$DAYTYPE)$out
subset(DATA, DATA$VALUE %in% boxplot(DATA$VALUE ~ DATA$DAYTYPE)$out)

# columns for visualization
DATA$OUTLIER <- ifelse(DATA$VALUE %in% boxplot(DATA$VALUE ~ DATA$DAYTYPE, plot = F)$out, DATA$VALUE, NA)
DATA$INLIER <- ifelse(is.na(DATA$OUTLIER), DATA$VALUE, NA)

 

R boxplot with data points and outliers in a different color

Here is ggplot2 based code to do that.

I also used package ggrepel and function geom_text_repel to deal with data labels. It helps to position them in a way that is easy to read.

Ggplot2 geom_jitter parameter position and function position_jitter was very important to synchronize how data points and data labels will position themselves with each other.

By using stat_boxplot, I added additional error bars to ggplot boxplot to better see upper and lower limits.

require(ggplot2)
require(ggrepel)
require(dplyr)

DATA %>%
  ggplot(aes(x = DAYTYPE, y = VALUE, label = DATE)) +
  theme_minimal()+
  theme(axis.text.x = element_text(colour = "gray44"), axis.title = element_text(colour = "gray44"))+ # change color of the axis labels and titles
  stat_boxplot(geom = "errorbar", width = 0.5) + # add proper whiskers on boxplot
  geom_boxplot(aes(fill = DAYTYPE), alpha = 0.5, outlier.shape = NA, show.legend = F) +
  geom_jitter(aes(x = DAYTYPE, y = INLIER), color = "gray44", na.rm = T, size = 1.8, stroke = 0, alpha = 0.5)+ # add inlier points with jitter
  geom_jitter(aes(x = DAYTYPE, y = OUTLIER), color = "red", na.rm = T, size = 1.8, stroke = 0, alpha = 0.5, position = position_jitter(seed = 1))+ # add outlier points in different color
  geom_text_repel(aes(x = DAYTYPE, y = OUTLIER), size = 4, position = position_jitter(seed = 1), na.rm = T, color = "gray44") # add data point labels without overlapping



Here are other great examples and my inspiration – boxplot with jitter.

Here is the full code used in this post.

DATA <- data.frame(
  DATE  = c("2018-10-01", "2018-10-02", "2018-10-03", "2018-10-04", "2018-10-05", "2018-10-06", "2018-10-07", "2018-10-08", "2018-10-09", "2018-10-10", "2018-10-11", "2018-10-12", "2018-10-13", "2018-10-14", "2018-10-15", "2018-10-16", "2018-10-17", "2018-10-18", "2018-10-19", "2018-10-20", "2018-10-21", "2018-10-22", "2018-10-23", "2018-10-24", "2018-10-25", "2018-10-26", "2018-10-27", "2018-10-28", "2018-10-29", "2018-10-30", "2018-10-31", "2018-11-01", "2018-11-02", "2018-11-03", "2018-11-04"),
  WEEKDAY = as.numeric(c("1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7")),
  VALUE = as.numeric(c("1878", "1754", "1778", "1792", "1638", "820", "950", "1740", "1822", "1680", "1990", "1880", "1504", "808", "1822", "1746", "3748", "2098", "1720", "906", "832", "2092", "2198", "2074", "1892", "1824", "1048", "938", "1790", "2006", "4010", "1808", "1866", "1106", "874"))
)

DATA$DAYTYPE <- ifelse(DATA$WEEKDAY >5, "WEEKEND", "WORKING DAY")

# base boxplot and outliers

boxplot(DATA$VALUE ~ DATA$DAYTYPE)
boxplot(DATA$VALUE ~ DATA$DAYTYPE)$out
subset(DATA, DATA$VALUE %in% boxplot(DATA$VALUE ~ DATA$DAYTYPE)$out)

# columns for visualization

DATA$OUTLIER <- ifelse(DATA$VALUE %in% boxplot(DATA$VALUE ~ DATA$DAYTYPE, plot = F)$out, DATA$VALUE, NA)
DATA$INLIER <- ifelse(is.na(DATA$OUTLIER), DATA$VALUE, NA)


# final boxplot

require(ggplot2)
require(ggrepel)
require(dplyr)

DATA %>%
  ggplot(aes(x = DAYTYPE, y = VALUE, label = DATE)) +
  theme_minimal()+
  theme(axis.text.x = element_text(colour = "gray44"), axis.title = element_text(colour = "gray44"))+ # change color of the axis labels and titles
  stat_boxplot(geom = "errorbar", width = 0.5) + # add proper whiskers on boxplot
  geom_boxplot(aes(fill = DAYTYPE), alpha = 0.5, outlier.shape = NA, show.legend = F) +
  geom_jitter(aes(x = DAYTYPE, y = INLIER), color = "gray44", na.rm = T, size = 1.8, stroke = 0, alpha = 0.5)+ # add inlier points with jitter
  geom_jitter(aes(x = DAYTYPE, y = OUTLIER), color = "red", na.rm = T, size = 1.8, stroke = 0, alpha = 0.5, position = position_jitter(seed = 1))+ # add outlier points in different color
  geom_text_repel(aes(x = DAYTYPE, y = OUTLIER), size = 4, position = position_jitter(seed = 1), na.rm = T, color = "gray44") # add data point labels without overlapping

 


Posted

in

,

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *