Part 9 Strings manipulation with stringr
The stringr
package provides tools for string manipulation.
All functions in stringr
start with str_ and take a vector of strings as the first argument.
We will show here a few useful functions (for a complete list of stringr
functions, you can have a look at the Cheat sheet. The cheat sheet also provides guidance on how to work with regular expressions.
Let’s take a simple character vector and a small tibble as examples:
<- c("genomics", "proteomics", "proteome", "transcriptomics", "metagenomics", "metabolomics")
examplestring
<- tibble(day=c("day0", "day1", "day2"),
exampletibble temperature=c("25C", "27C", "24Celsius"))
str_detect
: detects the presence or absence of a pattern in a string.
str_detect(examplestring,
pattern="genom")
## [1] TRUE FALSE FALSE FALSE TRUE FALSE
You can use regular expressions: as a simple example, here we want to detect which element of examplestring
starts with genom.
str_detect(examplestring,
pattern="^genom")
## [1] TRUE FALSE FALSE FALSE FALSE FALSE
You can reverse the search and output elements where the pattern is NOT found with negate=TRUE
str_detect(examplestring,
pattern="genom",
negate=TRUE)
## [1] FALSE TRUE TRUE TRUE FALSE TRUE
str_length
: outputs length of strings (number of characters) in each element of a vector.
str_length(examplestring)
## [1] 8 10 8 15 12 12
str_replace
: looks for a pattern in a string and replace it.
We can replace “omics” with “ome”
str_replace(examplestring,
pattern="omics",
"ome")
## [1] "genome" "proteome" "proteome" "transcriptome" "metagenome"
## [6] "metabolome"
str_replace
can be used to remove selected patterns from strings:
str_replace(examplestring,
pattern="omics",
"")
## [1] "gen" "prote" "proteome" "transcript" "metagen" "metabol"
# str_remove is a wrapper for the same thing (no need for the 3rd argument)
str_remove(examplestring,
pattern="omics")
## [1] "gen" "prote" "proteome" "transcript" "metagen" "metabol"
Same with a tibble’s column:
str_remove(exampletibble$day,
pattern="day")
## [1] "0" "1" "2"
You can use it inside another tidyverse
function:
mutate(exampletibble, day=str_remove(day, pattern="day"))
## # A tibble: 3 x 2
## day temperature
## <chr> <chr>
## 1 0 25C
## 2 1 27C
## 3 2 24Celsius
str_count
: count the number of occurences of a pattern:
Count how many times “omics” is found in each element:
str_count(examplestring,
pattern="omics")
## [1] 1 1 0 1 1 1
Count how many vowels are found in each element:
str_count(examplestring,
pattern="[aeiouy]")
## [1] 3 4 4 4 5 5
str_sub
: extracts and replace substrings from a character vector
str_sub(examplestring,
start=1, # position of the first character
end=10) # position of the last character
## [1] "genomics" "proteomics" "proteome" "transcript" "metagenomi" "metabolomi"
Let’s keep the first 2 characters of the temperature column of our exampletibble
:
str_sub(exampletibble$temperature,
start=1,
end=2)
## [1] "25" "27" "24"
Within mutate
:
mutate(exampletibble, temperature=str_sub(temperature, start=1, end=2))
## # A tibble: 3 x 2
## day temperature
## <chr> <chr>
## 1 day0 25
## 2 day1 27
## 3 day2 24
HANDS-ON
We will play with the following character vector:
<- c("Germany", "Uganda", "Canada", "Australia", "Switzerland", "Thailand", "Bolivia", "Russia", "Italy", "Senegal", "South Korea", "Mexico", "Argentina", "England") countries
- What is the average length of the country names?
- How many country names end with an “a”?
- Replace empty spaces with underscores in
countries
. - In which country name do you find more the letter “a”?
Answer
# What is the average length of the country names?
mean(str_length(countries))
# How many country names end with an "a"?
# Get the logical vector
str_detect(countries, "a$")
# Retrieve only the "TRUE" and count
length(which(str_detect(countries, "a$")))
length(countries[str_detect(countries, "a$")])
# Replace empty spaces with underscores in `countries`.
str_replace(countries, " ", "_")
# In which country name do you find more the letter "a"?
# count how many "a" per country name
str_count(countries, "a")
# extract the country name where there are more "a".
max(str_count(countries, "a"))] countries[