Reviewing some R basics
Open the RStudio software.
Basics
Objects
Everything that stores any kind of data in R is an object.
- Assignment operators
- <- or =
- Essentially the same but, to avoid confusions:
+ Use <- for assignments
- Keep = for functions arguments
- Assigning a value to the object B:
B <- 10
- Reassigning: modifying the content of an object:
B + 10
B unchanged !!
B <- B + 10
B changed !!
- You can see the objects you created in the upper right panel in RStudio: the environment.
Data types and data structures
Each object has a data type:
- Numeric (number - integer or double)
- Character (text)
- Logical (TRUE / FALSE)
Number:
a <- 10
mode(a)
typeof(a)
str(a)
Text:
b <- "word"
mode(b)
typeof(b)
str(b)
The main data structures in R are:
- Vector
- Factor
- Matrix
- Data frame
Create a vector:
a <- 1:5
Create a second vector, and check with elements of that second vector are also present in a with %in%:
b <- 3:8
b[b %in% a]
Check the length of (=number of elements in) a vector:
length(b)
Create a data frame:
# stringsAsFactors: ensures that characters are treated as characters and not as factors
d <- data.frame(Name=c("Maria", "Juan", "Alba"),
Age=c(23, 25, 31),
Vegetarian=c(TRUE, TRUE, FALSE),
stringsAsFactors = FALSE)
Check dimensions of a dataframe:
# Number of rows
nrow(d)
# Number of columns
ncol(d)
# Dimensions (first element is the number of rows, second element is the number of columns)
dim(d)
Select rows of the data frame if the Age column is superior to 24:
d[d$Age > 24,]
Select rows of the data frame if the Age column is superior to 24 AND if Vegetarian is TRUE :
d[d$Age > 24 & d$Vegetarian == TRUE,]
Paths and directories
- Get the path of the current directory (know where you are working at the moment) with getwd (get working directory):
getwd()
- Change working directory with setwd (set working directory)
Go to a directory giving the absolute path:setwd("~/rnaseq_course")
Go to a directory giving the relative path:
setwd("differential_expression")
You are now in: “~/rnaseq_course/differential_expression”
Move one directory “up” the tree:setwd("..")
You are now in: “~/rnaseq_course”
Missing values
NA (Not Available) is a recognized element in R.
- Finding missing values in a vector
# Create vector
x <- c(4, 2, 7, NA)
# Find missing values in vector:
is.na(x)
# Remove missing values
na.omit(x)
x[ !is.na(x) ]
- Some functions can deal with NAs, either by default, or with specific arguments:
x <- c(4, 2, 7, NA)
# default arguments
mean(x)
# set na.rm=TRUE
mean(x, na.rm=TRUE)
- In a matrix or a data frame, keep only rows where there are no NA values:
# Create matrix with some NA values
mydata <- matrix(c(1:10, NA, 12:2, NA, 15:20, NA), ncol=3)
# Keep only rows without NAs
mydata[complete.cases(mydata), ]
# or
na.omit(mydata)
Check this R blogger post on missing/null values
Read in, write out
On vectors
- Read a file as a vector with the scan function
# Read in file
scan(file="file.txt")
# Save in object
k <- scan(file="file.txt")
By default, scans “double” (numeric) elements: it fails if the input contains characters.
If non-numeric, you need to specify the type of data contained in the file:
# specify the type of data to scan
scan(file="file.txt",
what="character")
scan(file="~/file.txt",
what="character")
Regarding paths of files:
If the file is not in the current directory, you can provide a full or relative path. For example, if located in the home directory, read it as:
scan(file="~/file.txt",
what="character")
- Write the content of a vector in a file:
# create a vector
mygenes <- c("SMAD4", "DKK1", "ASXL3", "ERG", "CKLF", "TIAM1", "VHL", "BTD", "EMP1", "MALL", "PAX3")
# write in a file
write(x=mygenes,
file="gene_list.txt")
Regarding paths of files:
When you write a file, you can also specify a full or relative path:
# Write to home directory
write(x=mygenes,
file="~/gene_list.txt")
# Write to one directory up
write(x=mygenes,
file="../gene_list.txt")
On data frames or matrices
- Read in a file into a data frame with the read.table function:
a <- read.table(file="file.txt")
You can convert it as a matrix, if needed, with:
a <- as.matrix(read.table(file="file.txt"))
- Write a data frame or matrix to a file:
write.table(x=a,
file="file.txt")
Useful arguments:
- Note that “\t” stands for tab-delimitation
Install packages
R base
A set a standard packages which are supplied with R by default.
Example: package base (write, table, rownames functions), package utils (read.table, str functions), package stats (var, na.omit, median functions).
R contrib
All other packages:
- CRAN: Comprehensive R Archive Network + 15356* packages available + find packages in https://cran.r-project.org/web/packages/
- Bioconductor: + 1823* packages available + find packages in https://bioconductor.org/packages
*As of February 2020
Install a CRAN package using install.packages:
install.packages('BiocManager', repos = 'http://cran.us.r-project.org', dependencies = TRUE)
Install a Bioconductor package using BiocManager::install:
library('BiocManager')
BiocManager::install('GOstats')
Exercise: warming up !
- Ex1.
- Create a numeric vector y which contains the numbers from 2 to 11, both included.
- How many elements are in y? I.e what is the length of vector y ?
- Show the 3rd and the 6th elements of y.
- Show all elements of y that have a value inferior to 7.
- Ex2.
- Create the vector x of 1000 random numbers from the normal distribution (with rnorm).
- What are the mean, median, minimum and maximum values of x?
- Ex3.
- Create vector y2 as: y2 <- c(1, 11, 5, 62, NA, 18, 2, 8, NA)
- What is the sum of all elements in y2 ?
- Which elements of y2 are also present in y?
- Remove NA values from y2.
- Ex4.
- Create the following data frame:
43 | 181 | M |
34 | 172 | F |
22 | 189 | M |
27 | 167 | F |
with row names: John, Jessica, Steve, Rachel and column names: Age, Height, Sex.
- Check the structure of df with str().
- Calculate the average age and height in df.
- Change the row names of df so the data becomes anonymous (use for example Patient1, Patient2, etc.)
- Write df to the file mydf.txt with write.table(). Explore parameters sep, row.names, col.names, quote.
CORRECTION
# Ex1.
#Create a numeric vector y which contains the numbers from 2 to 11, both included.
y <- 2:11
# same as y <- c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
#How many elements are in y? I.e what is the length of vector y ?
length(y)
#Show the 3rd and the 6th elements of y.
y[c(3, 6)]
#Show all elements of y that have a value inferior to 7.
y[y < 7]
# Ex2.
#Create the vector x of 1000 random numbers from the normal distribution (with rnorm).
x <- rnorm(1000)
#What are the mean, median, minimum and maximum values of x?
mean(x); median(x); min(x); max(x)
# more straightforward:
summary(x)
# Ex3.
#Create vector y2 as: y2 <- c(1, 11, 5, 62, NA, 18, 2, 8, NA)
y2 <- c(1, 11, 5, 62, NA, 18, 2, 8, NA)
#What is the sum of all elements in y2 ?
sum(y2, na.rm = TRUE)
# same as sum(na.omit(y2))
#Which elements of y2 are also present in y?
y2[y2 %in% y]
#Remove NA values from y2.
y2 <- na.omit(y2)
# Ex4.
#Create the following data frame (I will call it df):
#with row names: John, Jessica, Steve, Rachel
#and column names: Age, Height, Sex.
df <- data.frame(Age=c(43, 34, 22, 27),
Height=c(181, 172, 189, 167),
Sex=c("M", "F", "M", "F"),
row.names = c("John", "Jessica", "Steve", "Rachel"))
#Check the structure of df with str().
str(df)
#Calculate the average age and height in df.
mean(df$Age) # same as mean(df[,"Age"])
mean(df$Height) # same as mean(df[,"Height"])
# or
colMeans(df[,c("Age", "Height")])
#Change the row names of df so the data becomes anonymous
# (use for example Patient1, Patient2, etc.)
rownames(df) <- c("Patient1", "Patient2", "Patient3", "Patient4")
#Write df to the file mydf.txt with write.table().
# Explore parameters sep, row.names, col.names, quote.
write.table(df,
"mydf.txt",
sep="\t",
row.names = TRUE,
col.names = NA,
quote = FALSE)