Remove or replace unwanted characters in R

Remove or replace unwanted characters in R

In a messy text might be a lot of characters that you don’t want to export to CSV or text files. That might cause a problem in loading them properly. Characters like tab, line breaks, or carriage returns. Here is how to remove or replace characters in R.

 

Nonprintable characters are hard to detect if you don’t know what to expect. In my experience, they appear as a problem, but you can find them if you know what to search for.
Here is my text that contains a line break and tab.

This	is my
text.

 

The problem is that you usually can’t write nonprintable characters in R straight away. One of the solutions to that problem is using a Unicode number. For example, if I would like to see if my text contains a newline character, I can use it in the grepl function like this.

text <- "This is my
text."

grepl("\u000A", text)
#[1] TRUE

To find a Unicode number, you can useĀ https://unicode-table.com/ or another recourse.

If I want to remove or replace a new line character (line break) in R, I can use the gsub function.

gsub('\u000A', " ", text, fixed = TRUE)

 

Here are examples of other popular nonprintable characters.

Replace Tab.

gsub('\u0009', " ", text, fixed = TRUE)
gsub('\u0022', " ", text, fixed = TRUE)

Replace carriage return.

gsub('\u000D', " ", text, fixed = TRUE)

Remove or replace multiple characters at once in R

If you want to replace multiple characters at once then I recommend str_replace_all from the stringr package. You just have to submit multiple patterns and replacements. Let’s say I want to replace the new line character and Tab simultaneously.

text <- "This	is my
text."

stringr::str_replace_all(text, c("\u000A" =" ", "\u0009" =" "))
#[1] "This is my text."

Remove trailing whitespace in R
Remove whitespace from the string before punctuation in R

Sometimes the extra white space may appear before the punctuation mark or between words, but you can easily remove them using rm_white from the qdapRegex package. Here is an example with consecutive white space and white space before the endmark.

> qdapRegex::rm_white("John   Doe was here .")
#[1] "John Doe was here."

 

Thank you for reading this and check out this other post on how to modify or create RStudio code snippets.





Posted

in

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *