9.9 Exercise 5. Data frame manipulation
Create the script “exercise5.R” and save it to the “Rcourse/Module1” directory: you will save all the commands of exercise 5 in that script.
Remember you can comment the code using #.
9.9.1 Exercise 5a
1- Create the following data frame:
43 | 181 | M |
34 | 172 | F |
22 | 189 | M |
27 | 167 | F |
With Row names: John, Jessica, Steve, Rachel.
And Column names: Age, Height, Sex.
correction
2- Check the structure of mydf with str().
correction
3- Calculate the average age and height in mydf
Try different approaches:
- Calculate the average for each column separately.
correction
- Calculate the average of both columns simultaneously using the apply() function.
correction
4- Add one row to mydf: Georges who is 53 years old and 168cm tall.
correction
5- Change the row names of mydf so the data becomes anonymous: Use Patient1, Patient2, etc. instead of actual names.
correction
6- Create the data frame mydf2 that is a subset of mydf containing only the female entries.
correction
7- Create the data frame mydf3 that is a subset of mydf containing only entries of males taller than 170.
9.9.2 Exercise 5b
1. Create two data frames mydf1 and mydf2 as:
mydf1:
1 | 14 |
2 | 12 |
3 | 15 |
4 | 10 |
mydf2:
1 | paul |
2 | helen |
3 | emily |
4 | john |
5 | mark |
With column names: “id”, “age” for mydf1, and “id”, “name” for mydf2.
correction
2- Merge mydf1 and mydf2 by their “id” column. Look for the help page of merge and/or Google it!
correction
3- Order mydf3 by decreasing age. Look for the help page of order.
9.9.3 Exercise 5c
1- Using the download.file function, download this file to your current directory. (Right click on “this file” -> Copy link location to get the full path).
correction
2- The function dir() lists the files and directories present in the current directory: check if genes_dataframe.RData was copied.
correction
3- Load genes_dataframe.RData in your environment Use the load function.
correction
4- genes_dataframe.RData contains the mydf_genes object: is it now present in your environment?
correction
5- Explore mydf_genes and see what it contains You can use a variety of functions: str, head, tail, dim, colnames, rownames, class…
correction
6- Select rows for which pvalue_KOvsWT < 0.05 AND log2FoldChange_KOvsWT > 0.5. Store in the up object.
correction
# rows where pvalue_KOvsWT < 0.05
mydf_genes$pvalue_KOvsWT < 0.05
# rows where log2FoldChange_KOvsWT > 0.5
mydf_genes$log2FoldChange_KOvsWT > 0.5
# rows that comply both of the above conditions
mydf_genes$pvalue_KOvsWT < 0.05 & mydf_genes$log2FoldChange_KOvsWT > 0.5
# select rows for which pvalue_KOvsWT < 0.05 AND log2FoldChange_KOvsWT > 0.5
up <- mydf_genes[mydf_genes$pvalue_KOvsWT < 0.05 &
mydf_genes$log2FoldChange_KOvsWT > 0.5,]
How many rows (genes) were selected?
7- Select from the up object the Zinc finger protein coding genes (i.e. the gene symbol starts with Zfp). Use the grep() function.
correction
8- Select rows for which pvalue_KOvsWT < 0.05 AND log2FoldChange_KOvsWT is > 0.5 OR < -0.5.
For the selection of log2FoldChange: give the abs function a try!
Store in the diff_genes object.
correction
# rows where pvalue_KOvsWT < 0.05
mydf_genes$pvalue_KOvsWT < 0.05
# rows where log2FoldChange_KOvsWT > 0.5
mydf_genes$log2FoldChange_KOvsWT > 0.5
# rows where log2FoldChange_KOvsWT < -0.5
mydf_genes$log2FoldChange_KOvsWT > -0.5
# rows where log2FoldChange_KOvsWT < -0.5 OR log2FoldChange_KOvsWT > 0.5
mydf_genes$log2FoldChange_KOvsWT > 0.5 | mydf_genes$log2FoldChange_KOvsWT > -0.5
# same as above but using the abs function
abs(mydf_genes$log2FoldChange_KOvsWT) > 0.5
# combine all required criteria
mydf_genes$pvalue_KOvsWT < 0.05 & abs(mydf_genes$log2FoldChange_KOvsWT) > 0.5
# extract corresponding entries
diff_genes <- mydf_genes[mydf_genes$pvalue_KOvsWT < 0.05 &
abs(mydf_genes$log2FoldChange_KOvsWT) > 0.5,]
How many rows (genes) were selected?