Part 9 Strings manipulation with `stringr`

The stringr package provides tools for string manipulation.
All functions in stringr start with str_ and take a vector of strings as the first argument.

We will show here a few useful functions (for a complete list of stringr functions, you can have a look at the Cheat sheet. The cheat sheet also provides guidance on how to work with regular expressions.

Let’s take a simple character vector and a small tibble as examples:

examplestring <- c("genomics", "proteomics", "proteome", "transcriptomics", "metagenomics", "metabolomics")

exampletibble <- tibble(day=c("day0", "day1", "day2"),
                    temperature=c("25C", "27C", "24Celsius"))

str_detect: detects the presence or absence of a pattern in a string.

str_detect(examplestring, 
            pattern="genom")

## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE

You can use regular expressions: as a simple example, here we want to detect which element of examplestring starts with genom.

str_detect(examplestring, 
            pattern="^genom")

## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

You can reverse the search and output elements where the pattern is NOT found with negate=TRUE

str_detect(examplestring, 
            pattern="genom",
           negate=TRUE)

## [1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE

str_length: outputs length of strings (number of characters) in each element of a vector.

str_length(examplestring)

## [1]  8 10  8 15 12 12

str_replace: looks for a pattern in a string and replace it.

We can replace “omics” with “ome”

str_replace(examplestring, 
            pattern="omics", 
            "ome")

## [1] "genome"        "proteome"      "proteome"      "transcriptome" "metagenome"   
## [6] "metabolome"

str_replace can be used to remove selected patterns from strings:

str_replace(examplestring, 
            pattern="omics", 
            "")

## [1] "gen"        "prote"      "proteome"   "transcript" "metagen"    "metabol"

# str_remove is a wrapper for the same thing (no need for the 3rd argument)
str_remove(examplestring, 
            pattern="omics")

## [1] "gen"        "prote"      "proteome"   "transcript" "metagen"    "metabol"

Same with a tibble’s column:

str_remove(exampletibble$day, 
            pattern="day")

## [1] "0" "1" "2"

You can use it inside another tidyverse function:

mutate(exampletibble, day=str_remove(day, pattern="day"))

## # A tibble: 3 x 2
##   day   temperature
##   <chr> <chr>      
## 1 0     25C        
## 2 1     27C        
## 3 2     24Celsius

str_count: count the number of occurences of a pattern:

Count how many times “omics” is found in each element:

str_count(examplestring, 
            pattern="omics")

## [1] 1 1 0 1 1 1

Count how many vowels are found in each element:

str_count(examplestring, 
            pattern="[aeiouy]")

## [1] 3 4 4 4 5 5

str_sub: extracts and replace substrings from a character vector

str_sub(examplestring, 
            start=1, # position of the first character
            end=10) # position of the last character

## [1] "genomics"   "proteomics" "proteome"   "transcript" "metagenomi" "metabolomi"

Let’s keep the first 2 characters of the temperature column of our exampletibble:

str_sub(exampletibble$temperature, 
        start=1, 
        end=2)

## [1] "25" "27" "24"

Within mutate:

mutate(exampletibble, temperature=str_sub(temperature, start=1, end=2))

## # A tibble: 3 x 2
##   day   temperature
##   <chr> <chr>      
## 1 day0  25         
## 2 day1  27         
## 3 day2  24

HANDS-ON

We will play with the following character vector:

countries <- c("Germany", "Uganda", "Canada", "Australia", "Switzerland", "Thailand", "Bolivia", "Russia", "Italy", "Senegal", "South Korea", "Mexico", "Argentina", "England")

What is the average length of the country names?
How many country names end with an “a”?
Replace empty spaces with underscores in countries.
In which country name do you find more the letter “a”?

Answer

# What is the average length of the country names?
mean(str_length(countries))

# How many country names end with an "a"?
  # Get the logical vector
str_detect(countries, "a$")
  # Retrieve only the "TRUE" and count
length(which(str_detect(countries, "a$")))
length(countries[str_detect(countries, "a$")])

# Replace empty spaces with underscores in `countries`.
str_replace(countries, " ", "_")

# In which country name do you find more the letter "a"?
  # count how many "a" per country name
str_count(countries, "a")
  # extract the country name where there are more "a".
countries[max(str_count(countries, "a"))]

Part 9 Strings manipulation with stringr

HANDS-ON

Answer

Part 9 Strings manipulation with `stringr`