Reviewing some R basics
Open the RStudio software.
Everything that stores any kind of data in R is an object.
- Assignment operators
- <- or =
- Essentially the same but, to avoid confusions:
+ Use <- for assignments
- Keep = for functions arguments
- Assigning a value to the object B:
B <- 10
- Reassigning: modifying the content of an object:
B + 10
B unchanged !!
B <- B + 10
B changed !!
- You can see the objects you created in the upper right panel in RStudio: the environment.
Data types and data structures
Each object has a data type:
- Numeric (number - integer or double)
- Character (text)
- Logical (TRUE / FALSE)
a <- 10 mode(a) typeof(a) str(a)
b <- "word" mode(b) typeof(b) str(b)
The main data structures in R are:
- Data frame
Create a vector:
a <- 1:5
Create a second vector, and check with elements of that second vector are also present in a with %in%:
b <- 3:8 b[b %in% a]
Check the length of (=number of elements in) a vector:
Create a data frame:
# stringsAsFactors: ensures that characters are treated as characters and not as factors d <- data.frame(Name=c("Maria", "Juan", "Alba"), Age=c(23, 25, 31), Vegetarian=c(TRUE, TRUE, FALSE), stringsAsFactors = FALSE)
Check dimensions of a dataframe:
# Number of rows nrow(d) # Number of columns ncol(d) # Dimensions (first element is the number of rows, second element is the number of columns) dim(d)
Select rows of the data frame if the Age column is superior to 24:
d[d$Age > 24,]
Select rows of the data frame if the Age column is superior to 24 AND if Vegetarian is TRUE :
d[d$Age > 24 & d$Vegetarian == TRUE,]
Paths and directories
- Get the path of the current directory (know where you are working at the moment) with getwd (get working directory):
- Change working directory with setwd (set working directory)
Go to a directory giving the absolute path:
Go to a directory giving the relative path:
You are now in: “~/rnaseq_course/differential_expression”
Move one directory “up” the tree:
You are now in: “~/rnaseq_course”
NA (Not Available) is a recognized element in R.
- Finding missing values in a vector
# Create vector x <- c(4, 2, 7, NA) # Find missing values in vector: is.na(x) # Remove missing values na.omit(x) x[ !is.na(x) ]
- Some functions can deal with NAs, either by default, or with specific arguments:
x <- c(4, 2, 7, NA) # default arguments mean(x) # set na.rm=TRUE mean(x, na.rm=TRUE)
- In a matrix or a data frame, keep only rows where there are no NA values:
# Create matrix with some NA values mydata <- matrix(c(1:10, NA, 12:2, NA, 15:20, NA), ncol=3) # Keep only rows without NAs mydata[complete.cases(mydata), ] # or na.omit(mydata)
Check this R blogger post on missing/null values
Read in, write out
- Read a file as a vector with the scan function
# Read in file scan(file="file.txt") # Save in object k <- scan(file="file.txt")
By default, scans “double” (numeric) elements: it fails if the input contains characters.
If non-numeric, you need to specify the type of data contained in the file:
# specify the type of data to scan scan(file="file.txt", what="character") scan(file="~/file.txt", what="character")
Regarding paths of files:
If the file is not in the current directory, you can provide a full or relative path. For example, if located in the home directory, read it as:
- Write the content of a vector in a file:
# create a vector mygenes <- c("SMAD4", "DKK1", "ASXL3", "ERG", "CKLF", "TIAM1", "VHL", "BTD", "EMP1", "MALL", "PAX3") # write in a file write(x=mygenes, file="gene_list.txt")
Regarding paths of files:
When you write a file, you can also specify a full or relative path:
# Write to home directory write(x=mygenes, file="~/gene_list.txt") # Write to one directory up write(x=mygenes, file="../gene_list.txt")
On data frames or matrices
- Read in a file into a data frame with the read.table function:
a <- read.table(file="file.txt")
You can convert it as a matrix, if needed, with:
a <- as.matrix(read.table(file="file.txt"))
- Write a data frame or matrix to a file:
- Note that “\t” stands for tab-delimitation
A set a standard packages which are supplied with R by default.
Example: package base (write, table, rownames functions), package utils (read.table, str functions), package stats (var, na.omit, median functions).
All other packages:
- CRAN: Comprehensive R Archive Network + 15356* packages available + find packages in https://cran.r-project.org/web/packages/
- Bioconductor: + 1823* packages available + find packages in https://bioconductor.org/packages
*As of February 2020
Install a CRAN package using install.packages:
install.packages('BiocManager', repos = 'http://cran.us.r-project.org', dependencies = TRUE)
Install a Bioconductor package using BiocManager::install:
Exercise: warming up !
- Create a numeric vector y which contains the numbers from 2 to 11, both included.
- How many elements are in y? I.e what is the length of vector y ?
- Show the 3rd and the 6th elements of y.
- Show all elements of y that have a value inferior to 7.
- Create the vector x of 1000 random numbers from the normal distribution (with rnorm).
- What are the mean, median, minimum and maximum values of x?
- Create vector y2 as: y2 <- c(1, 11, 5, 62, NA, 18, 2, 8, NA)
- What is the sum of all elements in y2 ?
- Which elements of y2 are also present in y?
- Remove NA values from y2.
- Create the following data frame:
with row names: John, Jessica, Steve, Rachel and column names: Age, Height, Sex.
- Check the structure of df with str().
- Calculate the average age and height in df.
- Change the row names of df so the data becomes anonymous (use for example Patient1, Patient2, etc.)
- Write df to the file mydf.txt with write.table(). Explore parameters sep, row.names, col.names, quote.
# Ex1. #Create a numeric vector y which contains the numbers from 2 to 11, both included. y <- 2:11 # same as y <- c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11) #How many elements are in y? I.e what is the length of vector y ? length(y) #Show the 3rd and the 6th elements of y. y[c(3, 6)] #Show all elements of y that have a value inferior to 7. y[y < 7] # Ex2. #Create the vector x of 1000 random numbers from the normal distribution (with rnorm). x <- rnorm(1000) #What are the mean, median, minimum and maximum values of x? mean(x); median(x); min(x); max(x) # more straightforward: summary(x) # Ex3. #Create vector y2 as: y2 <- c(1, 11, 5, 62, NA, 18, 2, 8, NA) y2 <- c(1, 11, 5, 62, NA, 18, 2, 8, NA) #What is the sum of all elements in y2 ? sum(y2, na.rm = TRUE) # same as sum(na.omit(y2)) #Which elements of y2 are also present in y? y2[y2 %in% y] #Remove NA values from y2. y2 <- na.omit(y2) # Ex4. #Create the following data frame (I will call it df): #with row names: John, Jessica, Steve, Rachel #and column names: Age, Height, Sex. df <- data.frame(Age=c(43, 34, 22, 27), Height=c(181, 172, 189, 167), Sex=c("M", "F", "M", "F"), row.names = c("John", "Jessica", "Steve", "Rachel")) #Check the structure of df with str(). str(df) #Calculate the average age and height in df. mean(df$Age) # same as mean(df[,"Age"]) mean(df$Height) # same as mean(df[,"Height"]) # or colMeans(df[,c("Age", "Height")]) #Change the row names of df so the data becomes anonymous # (use for example Patient1, Patient2, etc.) rownames(df) <- c("Patient1", "Patient2", "Patient3", "Patient4") #Write df to the file mydf.txt with write.table(). # Explore parameters sep, row.names, col.names, quote. write.table(df, "mydf.txt", sep="\t", row.names = TRUE, col.names = NA, quote = FALSE)