Return matching patterns in R

Return matching patterns in R

Imagine that you are looking for multiple text patterns in a data frame column that contains the text. The question is which of the matching patterns in R are detected in text. There might be multiple different matches, or something is matching multiple times.

Which of the patterns is used to filter records? Which one is the more often occurring pattern?

 

Matching patterns in R

There might be multiple situations where you want to know matching patterns in R. Here is a sentences dataset from the stringr package that I will use as a data frame column.

require(dplyr)
require(stringr)

head(sentences)

#[1] "The birch canoe slid on the smooth planks."  "Glue the sheet to the dark blue background."
#[3] "It's easy to tell the depth of a well."      "These days a chicken leg is a rare dish."   
#[5] "Rice is often served in round bowls."        "The juice of lemons makes fine punch."

sdf <- data.frame("sentence" = sentences)

Here is a list of words that I will use to find a match.

prm <- c("water", "juice", "milk")

By using str_detect from stringr and dplyr package, I can filter by multiple detected strings.

head(sdf %>% filter(str_detect(sentence, paste(prm, collapse = "|"))), 2)

#                                   sentence
#1     The juice of lemons makes fine punch.
#2 The dune rose from the edge of the water.

The problem is that I cant see matching patterns. That is what function str_extract_all is can return. If I filter the rows that contain some of the words from the list, I will get results with a column that contains every matching string and occurrence.

sdf$match <- str_extract_all(sdf$sentence, paste(prm, collapse = "|")) 

sdf %>% filter(str_detect(sentence, paste(prm, collapse = "|")))

#                                             sentence        match
#1               The juice of lemons makes fine punch.        juice
#2           The dune rose from the edge of the water.        water
#3          The fin was sharp and cut the clear water.        water
#4                   A rag will soak up spilled water.        water
#5            Float the soap on top of the bath water.        water
#6                 The large house had hot water taps.        water
#7                 To make pure ice, you freeze water.        water
#8                     Grape juice and water mix well. juice, water
#9         A quart of milk is water for the most part.  milk, water
#10 The water in this well is a source of good health.        water
#11   A cruise in warm waters in a sleek yacht is fun.        water

If you are using the RStudio data viewer, there might be character(0) in the results, but it usually does not interfere with other actions.

 

Summarize all matching patterns in R

Now when you have all the matching patterns, you can summarize them. You can count them with the count function from dplyr or create a column for each of the patterns with pivot_wider.

require(tidyr)

x <- sdf %>%
  mutate(match = str_extract_all(sentence, paste(prm, collapse = "|"))) %>%
  unnest(match, keep_empty = T, names_repair = "unique") %>%
  count(sentence, match, name = "cnt") %>%
  pivot_wider(names_from = match, values_from = cnt) %>%
  select(-2)

x %>% filter(sentence == "Grape juice and water mix well.") %>% as.data.frame()

#                         sentence water milk juice
#1 Grape juice and water mix well.     1   NA     1

 

Exact match with str_detect, grepl, or other functions

Previously there was no problem with a list of words that I used to detect a match and filter records. But what if it is necessary to use an exact match of the pattern? Like in the situation below. If I’m looking for a “wine”, I’m also getting “twine”.

require(dplyr)
require(stringr)

sdf <- data.frame("sentence" = sentences)

prm <- c("wine", "juice") 

sdf %>% filter(str_detect(sentence, paste(prm, collapse = "|")))

#                                         sentence
#1           The juice of lemons makes fine punch.
#2       Port is a strong wine with s smoky taste.
#3                 Grape juice and water mix well.
#4      The bunch of grapes was pressed into wine.
#5 Sever the twine with a quick snip of the knife.

You can use regex to create boundaries that will help to get an exact match with str_detect.

prm <- c("\\bwine\\b", "juice") 

sdf %>% filter(str_detect(sentence, paste(prm, collapse = "|")))

#                                    sentence
#1      The juice of lemons makes fine punch.
#2  Port is a strong wine with s smoky taste.
#3            Grape juice and water mix well.
#4 The bunch of grapes was pressed into wine.

 

You might be interested in detecting multiple patterns or extracting text in R.





Posted

in

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *