Reviewing some R basics

Open the RStudio software.

Basics

Objects

Everything that stores any kind of data in R is an object.

Assignment operators
- <- or =
- Essentially the same but, to avoid confusions: + Use <- for assignments
  - Keep = for functions arguments
Assigning a value to the object B:
```
B <- 10
```
Reassigning: modifying the content of an object:
```
B + 10
```

B unchanged !!

B <- B + 10

B changed !!

You can see the objects you created in the upper right panel in RStudio: the environment.

Data types and data structures

Each object has a data type:

Numeric (number - integer or double)
Character (text)
Logical (TRUE / FALSE)

Number:

a <- 10
mode(a)
typeof(a)
str(a)

Text:

b <- "word"
mode(b)
typeof(b)
str(b)

The main data structures in R are:

Vector
Factor
Matrix
Data frame

Create a vector:

a <- 1:5

Create a second vector, and check with elements of that second vector are also present in a with %in%:

b <- 3:8

b[b %in% a]

Check the length of (=number of elements in) a vector:

length(b)

Create a data frame:

# stringsAsFactors: ensures that characters are treated as characters and not as factors
d <- data.frame(Name=c("Maria", "Juan", "Alba"), 
        Age=c(23, 25, 31),
        Vegetarian=c(TRUE, TRUE, FALSE),
        stringsAsFactors = FALSE)

Check dimensions of a dataframe:

# Number of rows
nrow(d)

# Number of columns
ncol(d)

# Dimensions (first element is the number of rows, second element is the number of columns)
dim(d)

Select rows of the data frame if the Age column is superior to 24:

d[d$Age > 24,]

Select rows of the data frame if the Age column is superior to 24 AND if Vegetarian is TRUE :

d[d$Age > 24 & d$Vegetarian == TRUE,]

Paths and directories

Get the path of the current directory (know where you are working at the moment) with getwd (get working directory):
```
getwd()
```
Change working directory with setwd (set working directory)
Go to a directory giving the absolute path:
```
setwd("~/rnaseq_course")
```
Go to a directory giving the relative path:
```
setwd("differential_expression")
```
You are now in: “~/rnaseq_course/differential_expression”
Move one directory “up” the tree:
```
setwd("..")
```
You are now in: “~/rnaseq_course”

Missing values

NA (Not Available) is a recognized element in R.

Finding missing values in a vector

# Create vector
x <- c(4, 2, 7, NA)

# Find missing values in vector:
is.na(x)

# Remove missing values
na.omit(x)
x[ !is.na(x) ]

Some functions can deal with NAs, either by default, or with specific arguments:

x <- c(4, 2, 7, NA)

# default arguments
mean(x)

# set na.rm=TRUE
mean(x, na.rm=TRUE)

In a matrix or a data frame, keep only rows where there are no NA values:

# Create matrix with some NA values
mydata <- matrix(c(1:10, NA, 12:2, NA, 15:20, NA), ncol=3)

# Keep only rows without NAs
mydata[complete.cases(mydata), ]
# or
na.omit(mydata)

Check this R blogger post on missing/null values

Read in, write out

On vectors

Read a file as a vector with the scan function

# Read in file
scan(file="file.txt")
# Save in  object
k <- scan(file="file.txt")

By default, scans “double” (numeric) elements: it fails if the input contains characters.
If non-numeric, you need to specify the type of data contained in the file:

# specify the type of data to scan
scan(file="file.txt", 
        what="character")
scan(file="~/file.txt", 
        what="character")

Regarding paths of files:
If the file is not in the current directory, you can provide a full or relative path. For example, if located in the home directory, read it as:

scan(file="~/file.txt", 
        what="character")

Write the content of a vector in a file:

# create a vector
mygenes <- c("SMAD4", "DKK1", "ASXL3", "ERG", "CKLF", "TIAM1", "VHL", "BTD", "EMP1", "MALL", "PAX3")
# write in a file
write(x=mygenes, 
        file="gene_list.txt")

Regarding paths of files:
When you write a file, you can also specify a full or relative path:

# Write to home directory
write(x=mygenes,
        file="~/gene_list.txt")
# Write to one directory up
write(x=mygenes,
        file="../gene_list.txt")

On data frames or matrices

Read in a file into a data frame with the read.table function:

a <- read.table(file="file.txt")

You can convert it as a matrix, if needed, with:

a <- as.matrix(read.table(file="file.txt"))

Write a data frame or matrix to a file:

write.table(x=a,
        file="file.txt")

Useful arguments:

Note that “\t” stands for tab-delimitation

Install packages

R base

A set a standard packages which are supplied with R by default.
Example: package base (write, table, rownames functions), package utils (read.table, str functions), package stats (var, na.omit, median functions).

R contrib

All other packages:

CRAN: Comprehensive R Archive Network + 15356^* packages available + find packages in https://cran.r-project.org/web/packages/
Bioconductor: + 1823^* packages available + find packages in https://bioconductor.org/packages

^*As of February 2020

Install a CRAN package using install.packages:

install.packages('BiocManager', repos = 'http://cran.us.r-project.org', dependencies = TRUE)

Install a Bioconductor package using BiocManager::install:

library('BiocManager')
BiocManager::install('GOstats')

Exercise: warming up !

Ex1.
- Create a numeric vector y which contains the numbers from 2 to 11, both included.
- How many elements are in y? I.e what is the length of vector y ?
- Show the 3rd and the 6th elements of y.
- Show all elements of y that have a value inferior to 7.
Ex2.
- Create the vector x of 1000 random numbers from the normal distribution (with rnorm).
- What are the mean, median, minimum and maximum values of x?
Ex3.
- Create vector y2 as: y2 <- c(1, 11, 5, 62, NA, 18, 2, 8, NA)
- What is the sum of all elements in y2 ?
- Which elements of y2 are also present in y?
- Remove NA values from y2.
Ex4.
- Create the following data frame:

43	181	M
34	172	F
22	189	M
27	167	F

with row names: John, Jessica, Steve, Rachel and column names: Age, Height, Sex.

Check the structure of df with str().
Calculate the average age and height in df.
Change the row names of df so the data becomes anonymous (use for example Patient1, Patient2, etc.)
Write df to the file mydf.txt with write.table(). Explore parameters sep, row.names, col.names, quote.

CORRECTION


# Ex1.
#Create a numeric vector y which contains the numbers from 2 to 11, both included.
y <- 2:11
# same as y <- c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
#How many elements are in y? I.e what is the length of vector y ?
length(y)
#Show the 3rd and the 6th elements of y.
y[c(3, 6)]
#Show all elements of y that have a value inferior to 7.
y[y < 7]

# Ex2.
#Create the vector x of 1000 random numbers from the normal distribution (with rnorm).
x <- rnorm(1000)
#What are the mean, median, minimum and maximum values of x?
mean(x); median(x); min(x); max(x)
  # more straightforward:
summary(x)

# Ex3.
#Create vector y2 as: y2 <- c(1, 11, 5, 62, NA, 18, 2, 8, NA)
y2 <- c(1, 11, 5, 62, NA, 18, 2, 8, NA)
#What is the sum of all elements in y2 ?
sum(y2, na.rm = TRUE)
# same as sum(na.omit(y2))
#Which elements of y2 are also present in y?
y2[y2 %in% y]
#Remove NA values from y2.
y2 <- na.omit(y2)

# Ex4.
#Create the following data frame (I will call it df):
  #with row names: John, Jessica, Steve, Rachel 
  #and column names: Age, Height, Sex.
df <- data.frame(Age=c(43, 34, 22, 27),
                 Height=c(181, 172, 189, 167),
                 Sex=c("M", "F", "M", "F"),
                 row.names = c("John", "Jessica", "Steve", "Rachel"))
#Check the structure of df with str().
str(df)
#Calculate the average age and height in df.
mean(df$Age) # same as mean(df[,"Age"])
mean(df$Height) # same as mean(df[,"Height"])
  # or
colMeans(df[,c("Age", "Height")])
#Change the row names of df so the data becomes anonymous 
  # (use for example Patient1, Patient2, etc.)
rownames(df) <- c("Patient1", "Patient2", "Patient3", "Patient4")
#Write df to the file mydf.txt with write.table(). 
  # Explore parameters sep, row.names, col.names, quote.
write.table(df,
            "mydf.txt",
            sep="\t",
            row.names = TRUE,
            col.names = NA,
            quote = FALSE)