Data used in this course

For this course, we will use public data sets from GEO data set GSE76647.

Exercise:

Go to the GEO page corresponding to entry GSE76647
What is the organism the samples come from? Which types of cells? What is the goal of the experiment?
What information do you get about the sequencing protocol?
Retrieve the SRA codes corresponding to all samplesfrom the SRA RUN Selector)

Data details

Data set GSE76647 has 10 Homo sapiens samples from differentiated (5-day differentiation) and undifferentiated primary keratinocytes.
Some of the samples underwent a knock-down of the FOXC1 gene:

GEO ID	SRA ID	Sample name	Differentiation	Condition
GSM2031982	SRR3091420	5p4_25c	undiff	WT
GSM2031983	SRR3091421	5p4_27c	undiff	WT
GSM2031984	SRR3091422	5p4_28c	diff 5 days	WT
GSM2031985	SRR3091423	5p4_29c	diff 5 days	WT
GSM2031986	SRR3091424	5p4_30c	diff 5 days	WT
GSM2031987	SRR3091425	5p4_31cfoxc1	undiff	WT
GSM2031988	SRR3091426	5p4_32cfoxc1	undiff	WT
GSM2031989	SRR3091427	5p4_33cfoxc1	undiff	WT
GSM2031990	SRR3091428	5p4_34cfoxc1	diff 5 days	WT
GSM2031991	SRR3091429	5p4_35cfoxc1	diff 5 days	WT

The raw data can be downloaded as follows, using fastq-dump

# Go to the "raw_data" folder
cd ~/rnaseq_course/raw_data

# Download raw data files
for sra in SRR309142{0,1,2,3,4,5,6,7,8,9}
do echo $sra
$RUN fastq-dump --gzip --origfmt --skip-technical --split-files $sra
done

BACK UP

In case downloading those files is too slow, we have prepared fastq files that correspond to chromosome 6 only. Please download it:

# go to raw data folder
cd ~/rnaseq_course/raw_data

# archive containing the 6 fastq files for chromosome 6 only
wget https://public-docs.crg.es/biocore/projects/training/PHINDaccess2020/fastq_chr6.tar.gz

# extract archive
tar -xvzf fastq_chr6.tar.gz

# remove .tar.gz file
rm fastq_chr6.tar.gz

How many reads are there per sample?
What is the read length?

Number of reads:

cd ~/rnaseq_course/raw_data/fastq_chr6

# With grep
zcat SRR3091420_1_chr6.fastq.gz | grep "^@" | wc -l

# With paste
zcat SRR3091420_1_chr6.fastq.gz | paste - - - - | wc -l

# With awk
zcat SRR3091420_1_chr6.fastq.gz | awk 'BEGIN{i=0}{i++;}END{print i/4}'

For all samples:

for fastq in *fastq.gz
do echo $fastq
zcat $fastq | awk 'BEGIN{i=0}{i++;}END{print i/4}'
done

Count read length:

# Count read length for all rows
zcat SRR3091420_1_chr6.fastq.gz | paste - - - - | awk '{print length($2)}'

# Summarize
zcat SRR3091420_1_chr6.fastq.gz | paste - - - - | awk '{print length($2)}' | sort | uniq -c