Here is how to extract phone numbers, ids, or any other pattern of numbers from text in R. With the help of regex, it is possible to do this seemingly complicated task easily.
Extract pattern of numbers from text in R
First of all, critically look at the available text. Clean it up as much as possible to reduce pattern variations. Here are a couple of examples that will help you remove or replace unwanted characters like unnecessary white space and others.
Here is a data frame that contains multiple number patterns, and some of them are ids.
tn <- data.frame("text" = "Text that contains id 346788 or 482991 and something like 123456 that is not.")
There are functions like str_extract and str_extract_all from the stringr package that can use regular expressions as a pattern parameter.
Let’s say that id contains 6 numbers, and each of them might be between 0 and 9. Everything between 0 and 9 is [0-9] and {6} means that there should be 6 digits in that range.
stringr::str_extract_all(tn$text, "[0-9]{6}") #[[1]] #[1] "346788" "482991" "123456"
If you are using function str_extract, then you will get the first of all available number patterns in text.
stringr::str_extract(tn$text, "[0-9]{6}") #[1] "346788"
Add the space symbol at the end or the beginning of the pattern if the text contains a number that contains more than 6 digits. If you have to look at the beginning or the end in the string, then take a look at this approach.
stringr::str_extract_all(tn$text, "\\s[0-9]{6}\\s")
Let’s say that pattern of ids contains digits within different ranges. For example, the first one should be 3 or 4 and the other 5 digits between 0 and 9. By using regex, you can specify any digit as necessary.
stringr::str_extract_all(tn$text, "[3-4][0-9]{5}") #[[1]] #[1] "346788" "482991"
Create a data frame that contains a new line for each detected pattern of numbers
This might be necessary to do further comparisons or calculations in R. Here is a data frame with a new column of all matching patterns of numbers.
tn$numbers <- stringr::str_extract_all(tn$text, "[3-4][0-9]{5}") tn # text numbers #1 Text that contains id 346788 or 482991 and something like 123456 that is not. 346788, 482991
By using unnest function from the tidyr package, you can split the list into new lines.
require(tidyr) tn <- tn %>% unnest(numbers, keep_empty = T, names_repair = "unique") %>% as.data.frame() tn # text numbers #1 Text that contains id 346788 or 482991 and something like 123456 that is not. 346788 #2 Text that contains id 346788 or 482991 and something like 123456 that is not. 482991
Extract phone numbers from text in R
There is a specialized function for that, like ex_phone, but the way phone numbers are represented in the text might be too different. An appropriate phone number in one region might not be in another. The best solution is something specially adapted to the specific situation. If you know the main principles of how to use regex to extract necessary combinations, you will succeed.
Let’s start with the definition that you are looking for any phone number that contains 8 digits, and each of them is between 0 and 9. In this example, one of the phone numbers contains a country code. That is not necessary for the results.
tn <- data.frame("text" = "Text that contains phone numbers like +37121200022 and 20123456 or something like 88888888 that is not.") stringr::str_extract_all(tn$text, "[0-9]{8}") #[[1]] #[1] "37121200" "20123456" "88888888"
To deal with that, you can specify pattern requirements. For example, the first digit of the number is exactly 2, and the other 7 digits are between 0 and 9.
stringr::str_extract_all(tn$text, "2[0-9]{7}") #[[1]] #[1] "21200022" "20123456"
If you want to detect the pattern of a certain count of numbers and more in R, use the plus symbol. In this case, you can get the pattern with 8 numbers or more like this.
stringr::str_extract_all(tn$text, "[0-9]{7}[0-9]+") #[[1]] #[1] "37121200022" "20123456" "88888888"
If you have multiple variations of phone numbers, it is possible to look for multiple patterns at once. To create a separate data frame column with the matching pattern, try this approach.
p <- paste(c("2[0-9]{7}","3712[0-9]{7}"), collapse="|") stringr::str_extract_all(tn$text, p) #[[1]] #[1] "37121200022" "20123456"
If the phone numbers contain a certain pattern of numbers, but there might be different symbols between them, you can describe that like this. In this case, dot or dash between numbers.
tn <- data.frame("text" = "Text that contains phone numbers like 371.212.000 and 111-201-234 or something like 88888888 that is not.") stringr::str_extract_all(tn$text, "\\d{3}[.-]\\d{3}[.-]\\d{3}") #[[1]] #[1] "371.212.000" "111-201-234"
Extract decimal numbers from text in R
It is not an easy task because of the variation of digits that might be in a decimal number. You can use the necessary decimal separator in regular expressions as it is. By using the plus symbol in regex, you can get one digit or more where it is needed.
tn <- data.frame("text" = "Text contains numbers like 10, 78, 121 and decimal numbers like 10.8, 78.99, and 121.0177.") stringr::str_extract_all(tn$text, "[0-9]+\\.[0-9]+") #[[1]] #[1] "10.8" "78.99" "121.0177"
Alternatively, you can look for patterns with digits with regular expressions like this.
stringr::str_extract_all(tn$text, "\\d\\d") #[[1]] #[1] "10" "78" "12" "10" "78" "99" "12" "01" "77"
The previously seen example looks like this.
stringr::str_extract_all(tn$text, "\\d+\\.\\d+") #[[1]] #[1] "10.8" "78.99" "121.0177"
Leave a Reply