Part 9 Strings manipulation with stringr

The stringr package provides tools for string manipulation.
All functions in stringr start with str_ and take a vector of strings as the first argument.

We will show here a few useful functions (for a complete list of stringr functions, you can have a look at the Cheat sheet. The cheat sheet also provides guidance on how to work with regular expressions.



Let’s take a simple character vector and a small tibble as examples:

examplestring <- c("genomics", "proteomics", "proteome", "transcriptomics", "metagenomics", "metabolomics")

exampletibble <- tibble(day=c("day0", "day1", "day2"),
                    temperature=c("25C", "27C", "24Celsius"))

  • str_detect: detects the presence or absence of a pattern in a string.
str_detect(examplestring, 
            pattern="genom")
## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE

You can use regular expressions: as a simple example, here we want to detect which element of examplestring starts with genom.

str_detect(examplestring, 
            pattern="^genom")
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

You can reverse the search and output elements where the pattern is NOT found with negate=TRUE

str_detect(examplestring, 
            pattern="genom",
           negate=TRUE)
## [1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE

  • str_length: outputs length of strings (number of characters) in each element of a vector.
str_length(examplestring)
## [1]  8 10  8 15 12 12

  • str_replace: looks for a pattern in a string and replace it.

We can replace “omics” with “ome”

str_replace(examplestring, 
            pattern="omics", 
            "ome")
## [1] "genome"        "proteome"      "proteome"      "transcriptome" "metagenome"   
## [6] "metabolome"

str_replace can be used to remove selected patterns from strings:

str_replace(examplestring, 
            pattern="omics", 
            "")
## [1] "gen"        "prote"      "proteome"   "transcript" "metagen"    "metabol"
# str_remove is a wrapper for the same thing (no need for the 3rd argument)
str_remove(examplestring, 
            pattern="omics")
## [1] "gen"        "prote"      "proteome"   "transcript" "metagen"    "metabol"

Same with a tibble’s column:

str_remove(exampletibble$day, 
            pattern="day")
## [1] "0" "1" "2"

You can use it inside another tidyverse function:

mutate(exampletibble, day=str_remove(day, pattern="day"))
## # A tibble: 3 x 2
##   day   temperature
##   <chr> <chr>      
## 1 0     25C        
## 2 1     27C        
## 3 2     24Celsius

  • str_count: count the number of occurences of a pattern:

Count how many times “omics” is found in each element:

str_count(examplestring, 
            pattern="omics")
## [1] 1 1 0 1 1 1

Count how many vowels are found in each element:

str_count(examplestring, 
            pattern="[aeiouy]")
## [1] 3 4 4 4 5 5

  • str_sub: extracts and replace substrings from a character vector
str_sub(examplestring, 
            start=1, # position of the first character
            end=10) # position of the last character
## [1] "genomics"   "proteomics" "proteome"   "transcript" "metagenomi" "metabolomi"

Let’s keep the first 2 characters of the temperature column of our exampletibble:

str_sub(exampletibble$temperature, 
        start=1, 
        end=2)
## [1] "25" "27" "24"

Within mutate:

mutate(exampletibble, temperature=str_sub(temperature, start=1, end=2))
## # A tibble: 3 x 2
##   day   temperature
##   <chr> <chr>      
## 1 day0  25         
## 2 day1  27         
## 3 day2  24

HANDS-ON

We will play with the following character vector:

countries <- c("Germany", "Uganda", "Canada", "Australia", "Switzerland", "Thailand", "Bolivia", "Russia", "Italy", "Senegal", "South Korea", "Mexico", "Argentina", "England")
  • What is the average length of the country names?
  • How many country names end with an “a”?
  • Replace empty spaces with underscores in countries.
  • In which country name do you find more the letter “a”?
Answer
# What is the average length of the country names?
mean(str_length(countries))

# How many country names end with an "a"?
  # Get the logical vector
str_detect(countries, "a$")
  # Retrieve only the "TRUE" and count
length(which(str_detect(countries, "a$")))
length(countries[str_detect(countries, "a$")])

# Replace empty spaces with underscores in `countries`.
str_replace(countries, " ", "_")

# In which country name do you find more the letter "a"?
  # count how many "a" per country name
str_count(countries, "a")
  # extract the country name where there are more "a".
countries[max(str_count(countries, "a"))]