Part 9 Strings manipulation with stringr
The stringr package provides tools for string manipulation.
All functions in stringr start with str_ and take a vector of strings as the first argument.
We will show here a few useful functions (for a complete list of stringr functions, you can have a look at the Cheat sheet. The cheat sheet also provides guidance on how to work with regular expressions.
Let’s take a simple character vector and a small tibble as examples:
examplestring <- c("genomics", "proteomics", "proteome", "transcriptomics", "metagenomics", "metabolomics")
exampletibble <- tibble(day=c("day0", "day1", "day2"),
temperature=c("25C", "27C", "24Celsius"))str_detect: detects the presence or absence of a pattern in a string.
str_detect(examplestring,
pattern="genom")## [1] TRUE FALSE FALSE FALSE TRUE FALSE
You can use regular expressions: as a simple example, here we want to detect which element of examplestring starts with genom.
str_detect(examplestring,
pattern="^genom")## [1] TRUE FALSE FALSE FALSE FALSE FALSE
You can reverse the search and output elements where the pattern is NOT found with negate=TRUE
str_detect(examplestring,
pattern="genom",
negate=TRUE)## [1] FALSE TRUE TRUE TRUE FALSE TRUE
str_length: outputs length of strings (number of characters) in each element of a vector.
str_length(examplestring)## [1] 8 10 8 15 12 12
str_replace: looks for a pattern in a string and replace it.
We can replace “omics” with “ome”
str_replace(examplestring,
pattern="omics",
"ome")## [1] "genome" "proteome" "proteome" "transcriptome" "metagenome"
## [6] "metabolome"
str_replace can be used to remove selected patterns from strings:
str_replace(examplestring,
pattern="omics",
"")## [1] "gen" "prote" "proteome" "transcript" "metagen" "metabol"
# str_remove is a wrapper for the same thing (no need for the 3rd argument)
str_remove(examplestring,
pattern="omics")## [1] "gen" "prote" "proteome" "transcript" "metagen" "metabol"
Same with a tibble’s column:
str_remove(exampletibble$day,
pattern="day")## [1] "0" "1" "2"
You can use it inside another tidyverse function:
mutate(exampletibble, day=str_remove(day, pattern="day"))## # A tibble: 3 x 2
## day temperature
## <chr> <chr>
## 1 0 25C
## 2 1 27C
## 3 2 24Celsius
str_count: count the number of occurences of a pattern:
Count how many times “omics” is found in each element:
str_count(examplestring,
pattern="omics")## [1] 1 1 0 1 1 1
Count how many vowels are found in each element:
str_count(examplestring,
pattern="[aeiouy]")## [1] 3 4 4 4 5 5
str_sub: extracts and replace substrings from a character vector
str_sub(examplestring,
start=1, # position of the first character
end=10) # position of the last character## [1] "genomics" "proteomics" "proteome" "transcript" "metagenomi" "metabolomi"
Let’s keep the first 2 characters of the temperature column of our exampletibble:
str_sub(exampletibble$temperature,
start=1,
end=2)## [1] "25" "27" "24"
Within mutate:
mutate(exampletibble, temperature=str_sub(temperature, start=1, end=2))## # A tibble: 3 x 2
## day temperature
## <chr> <chr>
## 1 day0 25
## 2 day1 27
## 3 day2 24
HANDS-ON
We will play with the following character vector:
countries <- c("Germany", "Uganda", "Canada", "Australia", "Switzerland", "Thailand", "Bolivia", "Russia", "Italy", "Senegal", "South Korea", "Mexico", "Argentina", "England")- What is the average length of the country names?
- How many country names end with an “a”?
- Replace empty spaces with underscores in
countries. - In which country name do you find more the letter “a”?
Answer
# What is the average length of the country names?
mean(str_length(countries))
# How many country names end with an "a"?
# Get the logical vector
str_detect(countries, "a$")
# Retrieve only the "TRUE" and count
length(which(str_detect(countries, "a$")))
length(countries[str_detect(countries, "a$")])
# Replace empty spaces with underscores in `countries`.
str_replace(countries, " ", "_")
# In which country name do you find more the letter "a"?
# count how many "a" per country name
str_count(countries, "a")
# extract the country name where there are more "a".
countries[max(str_count(countries, "a"))]