## 9.9 Exercise 5. Data frame manipulation

Create the script “exercise5.R” and save it to the “Rcourse/Module1” directory: you will save all the commands of exercise 5 in that script.
Remember you can comment the code using #.

correction

``````getwd()
setwd("Rcourse/Module1")
setwd("~/Rcourse/Module1")``````

### 9.9.1 Exercise 5a

1- Create the following data frame:

 43 181 M 34 172 F 22 189 M 27 167 F

With Row names: John, Jessica, Steve, Rachel.
And Column names: Age, Height, Sex.

correction

``````mydf <- data.frame(Age=c(43, 34, 22, 27),
Height=c(181, 172, 189, 167),
Sex=c("M", "F", "M", "F"),
row.names = c("John", "Jessica", "Steve", "Rachel"),
stringsAsFactors=FALSE)``````

2- Check the structure of mydf with str().

correction

``str(mydf)``

3- Calculate the average age and height in mydf

Try different approaches:

• Calculate the average for each column separately.

correction

``````mean(mydf\$Age)
mean(mydf\$Height)``````
• Calculate the average of both columns simultaneously using the apply() function.

correction

``````# we have to remove the Sex column: we can calculate the average only with numbers
apply(mydf[,-3], 2, mean)
apply(mydf[,1:2], 2, mean)
apply(mydf[,-grep("Sex", colnames(mydf))], 2, mean)``````

4- Add one row to mydf: Georges who is 53 years old and 168cm tall.

correction

``````# Georges= allows us to enter the row name at the same time as we add a row
mydf <- rbind(mydf, Georges=c(53, 168, "M"))``````

5- Change the row names of mydf so the data becomes anonymous: Use Patient1, Patient2, etc. instead of actual names.

correction

``````rownames(mydf) <- c("Patient1", "Patient2", "Patient3", "Patient4", "Patient5")
# try also the paste function!
rownames(mydf) <- paste("Patient", 1:5, sep="")``````

6- Create the data frame mydf2 that is a subset of mydf containing only the female entries.

correction

``````# which elements are female ("F" in the "Sex" colum)
mydf\$Sex=="F"
# retrieve rows that contain the female entries, and save in mydf2
mydf2 <- mydf[mydf\$Sex=="F",]``````

7- Create the data frame mydf3 that is a subset of mydf containing only entries of males taller than 170.

correction

``````# which entries are males
mydf\$Sex=="M"
# which entries are greater than 170 in column "Height"
mydf\$Sex=="M" & mydf\$Height > 170
# retrieve rows that contain the males that are taller than 170, and save in mydf3
mydf3 <- mydf[mydf\$Sex=="M" & mydf\$Height > 170,]``````

### 9.9.2 Exercise 5b

1. Create two data frames mydf1 and mydf2 as:

mydf1:

 1 14 2 12 3 15 4 10

mydf2:

 1 paul 2 helen 3 emily 4 john 5 mark

With column names: “id”, “age” for mydf1, and “id”, “name” for mydf2.

correction

``````mydf1 <- data.frame(id=1:4, age=c(14,12,15,10))
mydf2 <- data.frame(id=1:5, name=c("paul", "helen", "emily", "john", "mark"))``````

2- Merge mydf1 and mydf2 by their “id” column. Look for the help page of merge and/or Google it!

correction

``````# input 2 data frames
# "by" columns indicate by which column you want to merge the data
merge(x=mydf1, y=mydf2, by.x="id", by.y="id")
mydf3 <- merge(x=mydf1, y=mydf2, by="id")``````

3- Order mydf3 by decreasing age. Look for the help page of order.

correction

``````# order the age column (default is increasing order)
order(mydf3\$age)
# order the age column by decreasing order
order(mydf3\$age, decreasing = TRUE)
# order the whole data frame by the column age in decreasing order
mydf3[order(mydf3\$age, decreasing = TRUE), ]``````

### 9.9.3 Exercise 5c

correction

``````# failing: download.file("https://github.com/sbcrg/CRG_RIntroduction/blob/master/genes_dataframe.RData", "genes_dataframe.RData")

2- The function dir() lists the files and directories present in the current directory: check if genes_dataframe.RData was copied.

correction

``dir()``

correction

``load("genes_dataframe.RData")``

4- genes_dataframe.RData contains the mydf_genes object: is it now present in your environment?

correction

``ls()``

5- Explore mydf_genes and see what it contains You can use a variety of functions: str, head, tail, dim, colnames, rownames, class…

correction

``````str(mydf_genes)
tail(mydf_genes)
dim(mydf_genes)
colnames(mydf_genes)
rownames(mydf_genes)
class(mydf_genes)``````

6- Select rows for which pvalue_KOvsWT < 0.05 AND log2FoldChange_KOvsWT > 0.5. Store in the up object.

correction

``````# rows where pvalue_KOvsWT < 0.05
mydf_genes\$pvalue_KOvsWT < 0.05
# rows where log2FoldChange_KOvsWT > 0.5
mydf_genes\$log2FoldChange_KOvsWT > 0.5
# rows that comply both of the above conditions
mydf_genes\$pvalue_KOvsWT < 0.05 & mydf_genes\$log2FoldChange_KOvsWT > 0.5
# select rows for which pvalue_KOvsWT < 0.05 AND log2FoldChange_KOvsWT > 0.5
up <- mydf_genes[mydf_genes\$pvalue_KOvsWT < 0.05 &
mydf_genes\$log2FoldChange_KOvsWT > 0.5,]``````

How many rows (genes) were selected?

7- Select from the up object the Zinc finger protein coding genes (i.e. the gene symbol starts with Zfp). Use the grep() function.

correction

``````# extract gene symbol column
up\$gene_symbol
# use grep to get the genes matching the pattern "Zfp"
up[grep("Zf", up\$gene_symbol), ]``````

8- Select rows for which pvalue_KOvsWT < 0.05 AND log2FoldChange_KOvsWT is > 0.5 OR < -0.5. For the selection of log2FoldChange: give the abs function a try!
Store in the diff_genes object.

correction

``````# rows where pvalue_KOvsWT < 0.05
mydf_genes\$pvalue_KOvsWT < 0.05
# rows where log2FoldChange_KOvsWT > 0.5
mydf_genes\$log2FoldChange_KOvsWT > 0.5
# rows where log2FoldChange_KOvsWT < -0.5
mydf_genes\$log2FoldChange_KOvsWT > -0.5
# rows where log2FoldChange_KOvsWT < -0.5 OR log2FoldChange_KOvsWT > 0.5
mydf_genes\$log2FoldChange_KOvsWT > 0.5 | mydf_genes\$log2FoldChange_KOvsWT > -0.5
# same as above but using the abs function
abs(mydf_genes\$log2FoldChange_KOvsWT) > 0.5
# combine all required criteria
mydf_genes\$pvalue_KOvsWT < 0.05 & abs(mydf_genes\$log2FoldChange_KOvsWT) > 0.5
# extract corresponding entries
diff_genes <- mydf_genes[mydf_genes\$pvalue_KOvsWT < 0.05 &
abs(mydf_genes\$log2FoldChange_KOvsWT) > 0.5,]``````

How many rows (genes) were selected?