Module 2

The plan

Tabular files

A general tabular file (or tab separated text) is a table which columns are separated by the character \t.
An easy way to obtain this format is to export a spreadsheet (like the excel file) in TSV.

You can download that file here Download File or via command line:

wget https://biocorecrg.github.io/advanced_linux_2019/my_expression.txt.zip

unzip my_expression.txt.zip 

more my_expression.txt

ids	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	gene.name	gene.type
ENSMUSG00000027009.18	6367.702449	-5.403229286	0.164511219	-32.84413864	1.3811E-236	3.1294E-232	Itga4	protein_coding
ENSMUSG00000030730.12	6085.911781	-6.880702974	0.224575305	-30.63873367	3.7337E-206	4.2301E-202	Atp2a1	protein_coding
ENSMUSG00000020393.16	3811.600859	-3.392964274	0.113074912	-30.00634011	8.1118E-198	6.1268E-194	Kremen1	protein_coding
ENSMUSG00000031962.6	5177.660551	-6.5722255	0.224972968	-29.21340091	1.3105E-187	7.4238E-184	Cdh15	protein_coding
ENSMUSG00000034648.9	7607.79254	-5.034267672	0.17328645	-29.05170986	1.4641E-185	6.6352E-182	Lrrn1	protein_coding
ENSMUSG00000049641.14	901.8440812	-7.022597773	0.246253844	-28.51771841	7.0643E-179	2.6678E-175	Vgll2	protein_coding
ENSMUSG00000050315.13	877.4673602	-4.465899309	0.157998266	-28.26549573	9.1813E-176	2.972E-172	Synpo2	protein_coding
ENSMUSG00000063659.11	2505.765754	-3.799372346	0.134497081	-28.2487347	1.4752E-175	4.1783E-172	Zbtb18	protein_coding

Exercise

Based on your knowledge can you tell how many genes this table has?

Let’s try now to select the genes with a log2FoldChange that are up-regulated.

awk '{if ($3>=2) print $0}' my_expression.txt |head -n 5

ids	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	gene.name	gene.type
ENSMUSG00000041272.11	692.0809888	2.853351422	0.173751262	16.4220472	1.33003E-60	1.57786E-58	Tox	protein_coding
ENSMUSG00000051177.16	336.7060978	2.671611253	0.176307092	15.15316949	7.21991E-52	6.76016E-50	Plcb1	protein_coding
ENSMUSG00000063415.12	2250.241249	3.41532392	0.22558106	15.14011824	8.80552E-52	8.21088E-50	Cyp26b1	protein_coding
ENSMUSG00000106795.1	170.0185428	2.898178587	0.19589793	14.79432982	1.59364E-49	1.40507E-47	RP24-150D8.2	lincRNA

Exercise

Can you tell how many genes are up-regulated?

Now let’s try to extract the up and down-regulated.
I’m using shuf just to show you that both the up and down regulated are both there.
We will use the operators || that means OR

awk '{if ($3>=2 || $3<=-2) print $0}' my_expression.txt | shuf | head -n 6 
ENSMUSG00000057777.4	2264.035736	2.81870418	0.30180407	9.339516784	9.67748E-21	2.38091E-19	Mab21l2	protein_coding
ENSMUSG00000028838.11	186.4738697	3.633200922	0.647669945	5.609648789	2.02738E-08	1.52771E-07	Extl1	protein_coding
ENSMUSG00000064115.13	383.3757202	2.471663969	0.264982674	9.327643683	1.08251E-20	2.6546E-19	Cadm2	protein_coding
ENSMUSG00000025278.9	13010.44773	2.314262167	0.306197486	7.55807045	4.09092E-14	5.91175E-13	Flnb	protein_coding
ENSMUSG00000029384.5	3.863476129	-2.243196229	0.684465108	-3.277298146	0.001048056	0.003710611	AC134827.3	protein_coding
ENSMUSG00000071392.6	37.23725709	-4.98355982	0.541679915	-9.200193106	3.57307E-20	8.4866E-19	Ect2l	protein_coding

Sometimes you want to add more filters to be more stringent. In that case you can use the operator && that indicates AND. In this case we ask for both log2FoldChange >= 2 and padj <= 0.0001

awk '{if ($3>=2 && $6<=0.0001) print $0}' my_expression.txt | head -n 3

ENSMUSG00000041272.11	692.0809888	2.853351422	0.173751262	16.4220472	1.33003E-60	1.57786E-58	Tox	protein_coding
ENSMUSG00000051177.16	336.7060978	2.671611253	0.176307092	15.15316949	7.21991E-52	6.76016E-50	Plcb1	protein_coding
ENSMUSG00000063415.12	2250.241249	3.41532392	0.22558106	15.14011824	8.80552E-52	8.21088E-50	Cyp26b1	protein_coding

As you can see this combination removes the header. If you want to have it back you can add another IF before the current one.

awk '{if ($1=="ids") {print $0} else if ($3>=2 && $6<=0.0001) print $0}' my_expression.txt | head -n 5

ids	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	gene.name	gene.type
ENSMUSG00000041272.11	692.0809888	2.853351422	0.173751262	16.4220472	1.33003E-60	1.57786E-58	Tox	protein_coding
ENSMUSG00000051177.16	336.7060978	2.671611253	0.176307092	15.15316949	7.21991E-52	6.76016E-50	Plcb1	protein_coding
ENSMUSG00000063415.12	2250.241249	3.41532392	0.22558106	15.14011824	8.80552E-52	8.21088E-50	Cyp26b1	protein_coding
ENSMUSG00000106795.1	170.0185428	2.898178587	0.19589793	14.79432982	1.59364E-49	1.40507E-47	RP24-150D8.2	lincRNA

You can redirect the output of AWK to one or more files. So in one row you can extract the genes up and down regulated and send them to two different files

awk '{if ($3>=2) print $0 > "up_reg.txt"; if ($3<=-2) print $0 > "down_reg.txt" }' my_expression.txt

Careful that in this way the up_reg.txt will contain the header and the down_reg.txt not.

Next Session

Gtf file format